A Benchmark for Data Imputation Methods

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.

Related collections

Most cited references 53

Record: found
Abstract: found
Article: not found

MissForest--non-parametric missing value imputation for mixed-type data.

Daniel Stekhoven, Peter Bühlmann (2012)

Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. The package missForest is freely available from http://stat.ethz.ch/CRAN/. stekhoven@stat.math.ethz.ch; buhlmann@stat.math.ethz.ch

0 comments Cited 1069 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Missing data: our view of the state of the art.

Joseph Schafer, John W Graham (2002)

Statistical procedures for missing data have vastly improved, yet misconception and unsound practice still abound. The authors frame the missing-data problem, review methods, offer advice, and raise issues that remain unresolved. They clear up common misunderstandings regarding the missing at random (MAR) concept. They summarize the evidence against older procedures and, with few exceptions, discourage their use. They present, in both technical and practical language, 2 general approaches that come highly recommended: maximum likelihood (ML) and Bayesian multiple imputation (MI). Newer developments are discussed, including some for dealing with missing data that are not MAR. Although not yet in the mainstream, these procedures may eventually extend the ML and MI methods that currently represent the state of the art.

0 comments Cited 947 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Scikit-learn : machine learning in Python

F Pedregosa, A Gramfort, V MICHEL … (2011)

0 comments Cited 899 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Sebastian Jäger: URI : https://loop.frontiersin.org/people/1264525/overview

Arndt Allhorn: URI : https://loop.frontiersin.org/people/1302895/overview

Felix Bießmann: URI : https://loop.frontiersin.org/people/1342430/overview

Journal

Journal ID (nlm-ta): Front Big Data

Journal ID (iso-abbrev): Front Big Data

Journal ID (publisher-id): Front. Big Data

Title: Frontiers in Big Data

Publisher: Frontiers Media S.A.

ISSN (Electronic): 2624-909X

Publication date (Electronic): 08 July 2021

Publication date Collection: 2021

Volume: 4

Electronic Location Identifier: 693674

Affiliations

Beuth University of Applied Sciences, Berlin, Germany

Author notes

Edited by: Jinsung Yoon, Google, United States

Reviewed by: Jason Poulos, Duke University, United States

Qi Chen, Victoria University of Wellington, New Zealand

*Correspondence: Sebastian Jäger, sebastian.jaeger@ 123456beuth-hochschule.de

This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data

Article

Publisher ID: 693674

DOI: 10.3389/fdata.2021.693674

PMC ID: 8297389

PubMed ID: 34308343

SO-VID: d314bfce-5ad8-4964-954c-1174ca0d84f0

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

A Benchmark for Data Imputation Methods

Read this article at

Abstract

Related collections

Wikipedia Quality

Most cited references 53

MissForest--non-parametric missing value imputation for mixed-type data.

Missing data: our view of the state of the art.

Scikit-learn : machine learning in Python

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 136

Cited by 18

Most referenced authors 1,050