14
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A Benchmark for Data Imputation Methods

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.

          Related collections

          Most cited references53

          • Record: found
          • Abstract: found
          • Article: not found

          MissForest--non-parametric missing value imputation for mixed-type data.

          Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. The package missForest is freely available from http://stat.ethz.ch/CRAN/. stekhoven@stat.math.ethz.ch; buhlmann@stat.math.ethz.ch
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Missing data: our view of the state of the art.

            Statistical procedures for missing data have vastly improved, yet misconception and unsound practice still abound. The authors frame the missing-data problem, review methods, offer advice, and raise issues that remain unresolved. They clear up common misunderstandings regarding the missing at random (MAR) concept. They summarize the evidence against older procedures and, with few exceptions, discourage their use. They present, in both technical and practical language, 2 general approaches that come highly recommended: maximum likelihood (ML) and Bayesian multiple imputation (MI). Newer developments are discussed, including some for dealing with missing data that are not MAR. Although not yet in the mainstream, these procedures may eventually extend the ML and MI methods that currently represent the state of the art.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Scikit-learn : machine learning in Python

                Bookmark

                Author and article information

                Contributors
                Journal
                Front Big Data
                Front Big Data
                Front. Big Data
                Frontiers in Big Data
                Frontiers Media S.A.
                2624-909X
                08 July 2021
                2021
                : 4
                : 693674
                Affiliations
                Beuth University of Applied Sciences, Berlin, Germany
                Author notes

                Edited by: Jinsung Yoon, Google, United States

                Reviewed by: Jason Poulos, Duke University, United States

                Qi Chen, Victoria University of Wellington, New Zealand

                *Correspondence: Sebastian Jäger, sebastian.jaeger@ 123456beuth-hochschule.de

                This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data

                Article
                693674
                10.3389/fdata.2021.693674
                8297389
                34308343
                d314bfce-5ad8-4964-954c-1174ca0d84f0
                Copyright © 2021 Jäger, Allhorn and Bießmann.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

                History
                : 11 April 2021
                : 15 June 2021
                Categories
                Big Data
                Original Research

                data quality,data cleaning,imputation,missing data,benchmark,mcar,mnar,mar

                Comments

                Comment on this article