Rethink reporting of evaluation results in AI

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Aggregate metrics and lack of access to results limit understanding

Abstract

Artificial intelligence (AI) systems have begun to be deployed in high-stakes contexts, including autonomous driving and medical diagnosis. In contexts such as these, the consequences of system failures can be devastating. It is therefore vital that researchers and policy-makers have a full understanding of the capabilities and weaknesses of AI systems so that they can make informed decisions about where these systems are safe to use and how they might be improved. Unfortunately, current approaches to AI evaluation make it exceedingly difficult to build such an understanding, for two key reasons. First, aggregate metrics make it hard to predict how a system will perform in a particular situation. Second, the instance-by-instance evaluation results that could be used to unpack these aggregate metrics are rarely made available ( 1 ). Here, we propose a path forward in which results are presented in more nuanced ways and instance-by-instance evaluation results are made publicly available.

Related collections

Most cited references 11

Record: found
Abstract: not found
Article: not found

International evaluation of an AI system for breast cancer screening

Scott Mayer McKinney, Marcin Sieniek, Varun Godbole … (2020)

0 comments Cited 634 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Unmasking Clever Hans predictors and assessing what machines really learn

Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder … (2019)

Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly intelligent behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to well-informed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.

0 comments Cited 173 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

Model Cards for Model Reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar … (2019)

0 comments Cited 141 times – based on 0 reviews

Bookmark

All references

Author and article information

Journal

Title: Science

Abbreviated Title: Science

Publisher: American Association for the Advancement of Science (AAAS)

ISSN (Print): 0036-8075

ISSN (Electronic): 1095-9203

Publication date Created: April 14 2023

Publication date (Print): April 14 2023

Volume: 380

Issue: 6641

Pages: 136-138

Affiliations

[1 ]Leverhulme Centre for the Future of Intelligence, University of Cambridge, Cambridge, UK.

[2 ]Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de Valencia, València, Spain.

[3 ]Centre for the Study of Existential Risk, University of Cambridge, Cambridge, UK.

[4 ]Department of Psychology, Harvard University, Cambridge, MA, USA.

[5 ]Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA.

[6 ]Department of Psychology, University of Cambridge, Cambridge, UK.

[7 ]Brain team, Google, Mountainview, CA, USA.

[8 ]Santa Fe Institute, Santa Fe, NM, USA.

[9 ]Stanford University, Stanford, CA, USA.

[10 ]DeepMind, London, UK.

[11 ]Department of Computing, Imperial College London, London, UK.

[12 ]National Institute of Standards and Technology (Retired), Gaithersburg, MD, USA.

[13 ]School of Computing, University of Leeds, Leeds, UK.

[14 ]Alan Turing Institute, London, UK.

[15 ]Tongji University, Shanghai, China.

[16 ]Shandong University, Jinan, China.

Article

DOI: 10.1126/science.adf6369

PubMed ID: 37053341

SO-VID: 7cffa492-8f7e-4de1-82e9-9671e5a9cb48

History

Data availability:

Comments

Comment on this article

scite_

Cited by 7

See all cited by

Most referenced authors 231

See all reference authors

Rethink reporting of evaluation results in AI

Read this article at

Abstract

Abstract

Related collections

Scientific Ocean Drilling Expedition Research Results

Most cited references 11

International evaluation of an AI system for breast cancer screening

Unmasking Clever Hans predictors and assessing what machines really learn

Model Cards for Model Reporting

Author and article information

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 194

Cited by 7

Most referenced authors 231