42
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

          Related collections

          Most cited references68

          • Record: found
          • Abstract: not found
          • Article: not found

          Random Forests

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            mice: Multivariate Imputation by Chained Equations inR

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              MissForest--non-parametric missing value imputation for mixed-type data.

              Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. The package missForest is freely available from http://stat.ethz.ch/CRAN/. stekhoven@stat.math.ethz.ch; buhlmann@stat.math.ethz.ch
                Bookmark

                Author and article information

                Journal
                Am J Epidemiol
                Am. J. Epidemiol
                aje
                amjepid
                American Journal of Epidemiology
                Oxford University Press
                0002-9262
                1476-6256
                15 March 2014
                12 January 2014
                12 January 2014
                : 179
                : 6
                : 764-774
                Author notes
                [* ]Correspondence to Dr. Anoop D. Shah, Clinical Epidemiology Group, Department of Epidemiology and Public Health, School of Life and Medical Sciences, University College London, Wolfson House, 2-10 Stephenson Way, London NW1 2HE, United Kingdom (e-mail: anoop@ 123456doctors.org.uk ).

                Abbreviations: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; MAR, missing at random; MICE, multivariate imputation by chained equations.

                Article
                kwt312
                10.1093/aje/kwt312
                3939843
                24589914
                c39f8c5d-ab88-4f84-83c8-9d062ea6ece5
                © The Author 2014. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 5 April 2013
                : 20 November 2013
                Page count
                Pages: 11
                Categories
                Practice of Epidemiology

                Public health
                angina, stable,imputation,missing data,missingness at random,regression trees,simulation,survival

                Comments

                Comment on this article