There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.
Abstract
The release of ChatGPT has initiated new thinking about AI-based Chatbot and its application
and has drawn huge public attention worldwide. Researchers and doctors have started
thinking about the promise and application of AI-related large language models in
medicine during the past few months. Here, the comprehensive review highlighted the
overview of Chatbot and ChatGPT and their current role in medicine. Firstly, the general
idea of Chatbots, their evolution, architecture, and medical use are discussed. Secondly,
ChatGPT is discussed with special emphasis of its application in medicine, architecture
and training methods, medical diagnosis and treatment, research ethical issues, and
a comparison of ChatGPT with other NLP models are illustrated. The article also discussed
the limitations and prospects of ChatGPT. In the future, these large language models
and ChatGPT will have immense promise in healthcare. However, more research is needed
in this direction.
Background Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results Of the 4 data sets, AMBOSS-Step1 , AMBOSS-Step2 , NBME-Free-Step1 , and NBME-Free-Step2 , ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased ( P =.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 ( P <.001) and NBME-Free-Step2 ( P =.001) data sets, respectively. Conclusions ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.
[1]1Department of Biotechnology, School of Life Science and Biotechnology, Adamas University,
Kolkata , West Bengal, India
[2]2School of Mechanical Engineering, Vellore Institute of Technology, Vellore , Tamil Nadu, India
[3]3Department of Zoology, Fakir Mohan University, Balasore , Odisha, India
[4]4Institute for Skeletal Aging and Orthopedic Surgery, Hallym University Chuncheon Sacred
Heart Hospital, Chuncheon-si , Gangwon-do, Republic of Korea
Author notes
Edited by: Thomas Hartung, Johns Hopkins University, United States
Reviewed by: Hosna Salmani, Iran University of Medical Sciences, Iran; Alvise Sernicola,
University of Padua, Italy
This is an open-access article distributed under the terms of the Creative Commons
Attribution License (CC BY). The use, distribution or reproduction in other forums
is permitted, provided the original author(s) and the copyright owner(s) are credited
and that the original publication in this journal is cited, in accordance with accepted
academic practice. No use, distribution or reproduction is permitted which does not
comply with these terms.
This study was supported by Hallym University Research Fund and by Basic Science Research
Program through the National Research Foundation of Korea (NRF) funded by the Ministry
of Education (NRF-2020R1I1A3074575).
scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.