1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Rethink reporting of evaluation results in AI

      Read this article at

      ScienceOpenPublisherPubMed
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Aggregate metrics and lack of access to results limit understanding

          Abstract

          Artificial intelligence (AI) systems have begun to be deployed in high-stakes contexts, including autonomous driving and medical diagnosis. In contexts such as these, the consequences of system failures can be devastating. It is therefore vital that researchers and policy-makers have a full understanding of the capabilities and weaknesses of AI systems so that they can make informed decisions about where these systems are safe to use and how they might be improved. Unfortunately, current approaches to AI evaluation make it exceedingly difficult to build such an understanding, for two key reasons. First, aggregate metrics make it hard to predict how a system will perform in a particular situation. Second, the instance-by-instance evaluation results that could be used to unpack these aggregate metrics are rarely made available ( 1 ). Here, we propose a path forward in which results are presented in more nuanced ways and instance-by-instance evaluation results are made publicly available.

          Related collections

          Most cited references11

          • Record: found
          • Abstract: not found
          • Article: not found

          International evaluation of an AI system for breast cancer screening

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Unmasking Clever Hans predictors and assessing what machines really learn

            Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly intelligent behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to well-informed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.
              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Model Cards for Model Reporting

                Bookmark

                Author and article information

                Journal
                Science
                Science
                American Association for the Advancement of Science (AAAS)
                0036-8075
                1095-9203
                April 14 2023
                April 14 2023
                : 380
                : 6641
                : 136-138
                Affiliations
                [1 ]Leverhulme Centre for the Future of Intelligence, University of Cambridge, Cambridge, UK.
                [2 ]Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de Valencia, València, Spain.
                [3 ]Centre for the Study of Existential Risk, University of Cambridge, Cambridge, UK.
                [4 ]Department of Psychology, Harvard University, Cambridge, MA, USA.
                [5 ]Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA.
                [6 ]Department of Psychology, University of Cambridge, Cambridge, UK.
                [7 ]Brain team, Google, Mountainview, CA, USA.
                [8 ]Santa Fe Institute, Santa Fe, NM, USA.
                [9 ]Stanford University, Stanford, CA, USA.
                [10 ]DeepMind, London, UK.
                [11 ]Department of Computing, Imperial College London, London, UK.
                [12 ]National Institute of Standards and Technology (Retired), Gaithersburg, MD, USA.
                [13 ]School of Computing, University of Leeds, Leeds, UK.
                [14 ]Alan Turing Institute, London, UK.
                [15 ]Tongji University, Shanghai, China.
                [16 ]Shandong University, Jinan, China.
                Article
                10.1126/science.adf6369
                37053341
                7cffa492-8f7e-4de1-82e9-9671e5a9cb48
                © 2023
                History

                Comments

                Comment on this article