THU0556 MISSING DATA AND MULTIPLE IMPUTATION IN RHEUMATOID ARTHRITIS REGISTRIES USING SEQUENTIAL RANDOM FOREST METHOD

Ahmed Al-Saber,Adeeba Al-Herz,Jiazhu Pan,Khulood Saleh,Adel Al-Awadhi,Waleed Al-Kandari,Eman Hasan,Aqeel Ghanem,Mohammed Hussain,Yaser Ali,E. Nahar,Ahmad Alenizi,Sawsan Hayat,Fatemah Abutiban,A. Aledei,A. Al-Qadhi,Hebah Alhajeri,H. Behbehani,Naser Alhadhood

THU0556 MISSING DATA AND MULTIPLE IMPUTATION IN RHEUMATOID ARTHRITIS REGISTRIES USING SEQUENTIAL RANDOM FOREST METHOD

2020

Background: Missing data in clinical epidemiological researches violate the intention to treat principle, reduce statistical power and can induce bias if they are related to patient’s response to treatment. In multiple imputation (MI), covariates are included in the imputation equation to predict the values of missing data. Objectives: To find the best approach to estimate and impute the missing values in Kuwait Registry for Rheumatic Diseases (KRRD) patients data. Methods: A number of methods were implemented for dealing with missing data. These included Multivariate imputation by chained equations (MICE), K-Nearest Neighbors (KNN), Bayesian Principal Component Analysis (BPCA), EM with Bootstrapping (Amelia II), Sequential Random Forest (MissForest) and mean imputation. Choosing the best imputation method was judged by the minimum scores of Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Kolmogorov–Smirnov D test statistic (KS) between the imputed datapoints and the original datapoints that were subsequently sat to missing. Results: A total of 1,685 rheumatoid arthritis (RA) patients and 10,613 hospital visits were included in the registry. Among them, we found a number of variables that had missing values exceeding 5% of the total values. These included duration of RA (13.0%), smoking history (26.3%), rheumatoid factor (7.93%), anti-citrullinated peptide antibodies (20.5%), anti-nuclear antibodies (20.4%), sicca symptoms (19.2%), family history of a rheumatic disease (28.5%), steroid therapy (5.94%), ESR (5.16%), CRP (22.9%) and SDAI (38.0%), The results showed that among the methods used, MissForest gave the highest level of accuracy to estimate the missing values. It had the least imputation errors for both continuous and categorical variables at each frequency of missingness and it had the smallest prediction differences when the models used imputed laboratory values. In both data sets, MICE had the second least imputation errors and prediction differences, followed by KNN and mean imputation. Conclusion: MissForest is a highly accurate method of imputation for missing data in KRRD and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in clinical predictive models. This approach can be used in registries to improve the accuracy of data, including the ones for rheumatoid arthritis patients. References: [1]Junninen, H.; Niska, H.; Tuppurainen, K.; Ruuskanen, J.; Kolehmainen, M. Methods for imputation of missing values in air quality data sets. Atmospheric Environment2004, 38, 2895–2907. [2]Norazian, M.N.; Shukri, Y.A.; Azam, R.N.; Al Bakri, A.M.M. Estimation of missing values in air pollution data using single imputation techniques. ScienceAsia2008, 34, 341–345. [3]Plaia, A.; Bondi, A. Single imputation method of missing values in environmental pollution data sets. Atmospheric Environment2006, 40, 7316–7330. [4]Kabir, G.; Tesfamariam, S.; Hemsing, J.; Sadiq, R. Handling incomplete and missing data in water network database using imputation methods. Sustainable and Resilient Infrastructure2019, pp. 1–13. [5]Di Zio, M.; Guarnera, U.; Luzi, O. Imputation through finite Gaussian mixture models. Computational Statistics & Data Analysis2007, 51, 5305–5316. Disclosure of Interests: None declared

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations