5
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Leakage and the reproducibility crisis in machine-learning-based science

      research-article
      1 , 2 , , 1
      Patterns
      Elsevier
      reproducibility, machine learning, leakage

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Summary

          Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models.

          Highlights

          • Data leakage is a flaw in machine learning that leads to overoptimistic results

          • Our survey of prior reviews shows leakage affects 294 papers across 17 scientific fields

          • We provide a taxonomy of leakage and introduce model info sheets to mitigate it

          • We show how leakage can lead to overoptimism with a case study on civil war prediction

          The bigger picture

          Machine learning (ML) is widely used across dozens of scientific fields. However, a common issue called “data leakage” can lead to errors in data analysis. We surveyed a variety of research that uses ML and found that data leakage affects at least 294 studies across 17 fields, leading to overoptimistic findings. We classified these errors into eight different types. We propose a solution: model info sheets that can be used to identify and prevent each of these eight types of leakage. We also tested the reproducibility of ML in a specific field: predicting civil wars, where complex ML models were thought to outperform traditional statistical models. Interestingly, when we corrected for data leakage, the supposed superiority of ML models disappeared: they did not perform any better than older methods. Our work serves as a cautionary note against taking results in ML-based science at face value.

          Abstract

          Kapoor and Narayanan show that leakage is a widespread failure mode in machine-learning (ML)-based science. Based on a survey of past reviews, they find that it affects at least 294 papers across 17 disciplines. They provide a taxonomy of eight types of leakage and propose model info sheets to mitigate it. They show that leakage can lead to severe overoptimism through a case study of civil war prediction. Several papers claimed that ML models drastically outperform older regression models. This is no longer the case when leakage is fixed.

          Related collections

          Most cited references79

          • Record: found
          • Abstract: not found
          • Article: not found

          ImageNet Large Scale Visual Recognition Challenge

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            PSYCHOLOGY. Estimating the reproducibility of psychological science.

            Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

              The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature.
                Bookmark

                Author and article information

                Contributors
                Journal
                Patterns (N Y)
                Patterns (N Y)
                Patterns
                Elsevier
                2666-3899
                04 August 2023
                08 September 2023
                04 August 2023
                : 4
                : 9
                : 100804
                Affiliations
                [1 ]Department of Computer Science and Center for Information Technology Policy, Princeton University, Princeton, NJ 08540, USA
                Author notes
                []Corresponding author sayashk@ 123456princeton.edu
                [2]

                Lead contact

                Article
                S2666-3899(23)00159-9 100804
                10.1016/j.patter.2023.100804
                10499856
                37720327
                2c9ecce2-31d3-4445-9e23-1eff18d4fcdf
                © 2023 The Author(s)

                This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

                History
                : 3 March 2023
                : 18 May 2023
                : 5 July 2023
                Categories
                Article

                reproducibility,machine learning,leakage
                reproducibility, machine learning, leakage

                Comments

                Comment on this article