Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.

Author summary

Artificial intelligence (AI) systems hold great promise to improve medical care and health outcomes. As such, it is crucial to ensure that the development of clinical AI is guided by the principles of trust and explainability. Measuring AI medical knowledge in comparison to that of expert human clinicians is a critical first step in evaluating these qualities. To accomplish this, we evaluated the performance of ChatGPT, a language-based AI, on the United States Medical Licensing Exam (USMLE). The USMLE is a set of three standardized tests of expert-level knowledge, which are required for medical licensure in the United States. We found that ChatGPT performed at or near the passing threshold of 60% accuracy. Being the first to achieve this benchmark, this marks a notable milestone in AI maturation. Impressively, ChatGPT was able to achieve this result without specialized input from human trainers. Furthermore, ChatGPT displayed comprehensible reasoning and valid clinical insights, lending increased confidence to trust and explainability. Our study suggests that large language models such as ChatGPT may potentially assist human learners in a medical education setting, as a prelude to future integration into clinical decision-making.

Related collections

Most cited references 28

Record: found
Abstract: not found
Conference Proceedings: not found

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe … (2016)

0 comments Cited 1515 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: not found

Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.

Varun Gulshan, Lily Peng, Marc Coram … (2016)

Deep learning is a family of computational methods that allow an algorithm to program itself by learning from a large set of examples that demonstrate the desired behavior, removing the need to specify rules explicitly. Application of these methods to medical imaging requires further assessment and validation.

0 comments Cited 1344 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A deep learning system for differential diagnosis of skin diseases

Yuan Liu, Ayush Jain, Clara Eng … (2020)

Skin conditions affect 1.9 billion people. Because of a shortage of dermatologists, most cases are seen instead by general practitioners with lower diagnostic accuracy. We present a deep learning system (DLS) to provide a differential diagnosis of skin conditions using 16,114 de-identified cases (photographs and clinical data) from a teledermatology practice serving 17 sites. The DLS distinguishes between 26 common skin conditions, representing 80% of cases seen in primary care, while also providing a secondary prediction covering 419 skin conditions. On 963 validation cases, where a rotating panel of three board-certified dermatologists defined the reference standard, the DLS was non-inferior to six other dermatologists and superior to six primary care physicians (PCPs) and six nurse practitioners (NPs) (top-1 accuracy: 0.66 DLS, 0.63 dermatologists, 0.44 PCPs and 0.40 NPs). These results highlight the potential of the DLS to assist general practitioners in diagnosing skin conditions.

0 comments Cited 160 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Tiffany H. Kung: Role: ConceptualizationRole: MethodologyRole: SupervisionRole: ValidationRole: Writing – original draftRole: Writing – review & editing

Morgan Cheatham: Role: ConceptualizationRole: SupervisionRole: Writing – original draftRole: Writing – review & editing

Arielle Medenilla: Role: Data curationRole: MethodologyRole: Validation

Czarina Sillos: Role: Data curationRole: MethodologyRole: Project administration

Lorie De Leon: Role: Data curation

Camille Elepaño: Role: Data curation

Maria Madriaga: Role: Data curation

Rimel Aggabao: Role: Investigation

Giezel Diaz-Candido: Role: Data curation

James Maningo: Role: Data curationRole: Formal analysisRole: MethodologyRole: SoftwareRole: ValidationRole: Visualization

Victor Tseng:

ORCID: https://orcid.org/0000-0003-0211-512X

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SupervisionRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Alon Dagan: Role: Editor

Journal

Journal ID (nlm-ta): PLOS Digit Health

Journal ID (iso-abbrev): PLOS Digit Health

Journal ID (publisher-id): plos

Title: PLOS Digital Health

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Electronic): 2767-3170

Publication date (Electronic): 9 February 2023

Publication date Collection: February 2023

Volume: 2

Issue: 2

Electronic Location Identifier: e0000198

Affiliations

[1 ] AnsibleHealth, Inc Mountain View, California, United States of America

[2 ] Department of Anesthesiology, Massachusetts General Hospital, Harvard School of Medicine Boston, Massachusetts, United States of America

[3 ] Warren Alpert Medical School; Brown University Providence, Rhode Island, United States of America

[4 ] Department of Medical Education, UWorld, LLC Dallas, Texas, United States of America

Beth Israel Deaconess Medical Center, UNITED STATES

Author notes

The authors have declared that no competing interests exist.

* E-mail: victor@ 123456ansiblehealth.com

Author information

Victor Tseng https://orcid.org/0000-0003-0211-512X

Article

Publisher ID: PDIG-D-22-00371

DOI: 10.1371/journal.pdig.0000198

PMC ID: 9931230

PubMed ID: 36812645

SO-VID: 959168ca-d8bf-4c9e-984d-0c3e804791ee

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 19 December 2022

Date accepted : 23 January 2023

Page count

Figures: 3, Tables: 0, Pages: 12

Funding

The authors received no specific funding for this work.

Custom metadata

Data Availability The data analyzed in this study were obtained from USMLE sample questions sets which are publicly available. We have made the question indices, raw inputs, and raw AI outputs, and special annotations available in S1 Data.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Read this article at

Abstract

Author summary

Related collections

Genome Engineering using CRISPR

Most cited references 28

Rethinking the Inception Architecture for Computer Vision

Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.

A deep learning system for differential diagnosis of skin diseases

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 459

Cited by 591

Most referenced authors 469