Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Chat Generative Pre-Trained Transformer (ChatGPT) is a state-of-the-art large language model that has been evaluated across various medical fields, with mixed performance on licensing examinations. This study aimed to assess the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions from the Taiwan Plastic Surgery Board Examination.

Methods

The study evaluated the performance of ChatGPT-3.5 and ChatGPT-4 on 1375 questions from the past 8 years of the Taiwan Plastic Surgery Board Examination, including 985 single-choice and 390 multiple-choice questions. We obtained the responses between June and July 2023, launching a new chat session for each question to eliminate memory retention bias.

Results

Overall, ChatGPT-4 outperformed ChatGPT-3.5, achieving a 59 % correct answer rate compared to 41 % for ChatGPT-3.5. ChatGPT-4 passed five out of eight yearly exams, whereas ChatGPT-3.5 failed all. On single-choice questions, ChatGPT-4 scored 66 % correct, compared to 48 % for ChatGPT-3.5. On multiple-choice, ChatGPT-4 achieved a 43 % correct rate, nearly double the 23 % of ChatGPT-3.5.

Conclusion

As ChatGPT evolves, its performance on the Taiwan Plastic Surgery Board Examination is expected to improve further. The study suggests potential reforms, such as incorporating more problem-based scenarios, leveraging ChatGPT to refine exam questions, and integrating AI-assisted learning into candidate preparation. These advancements could enhance the assessment of candidates' critical thinking and problem-solving abilities in the field of plastic surgery.

Related collections

Most cited references 27

Record: found
Abstract: found
Article: found

Is Open Access

ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models

Namkee Oh, Gyu-Seong Choi, Woo Yong Lee (2023)

Purpose This study aimed to assess the performance of ChatGPT, specifically the GPT-3.5 and GPT-4 models, in understanding complex surgical clinical information and its potential implications for surgical education and training. Methods The dataset comprised 280 questions from the Korean general surgery board exams conducted between 2020 and 2022. Both GPT-3.5 and GPT-4 models were evaluated, and their performances were compared using McNemar test. Results GPT-3.5 achieved an overall accuracy of 46.8%, while GPT-4 demonstrated a significant improvement with an overall accuracy of 76.4%, indicating a notable difference in performance between the models (P < 0.001). GPT-4 also exhibited consistent performance across all subspecialties, with accuracy rates ranging from 63.6% to 83.3%. Conclusion ChatGPT, particularly GPT-4, demonstrates a remarkable ability to understand complex surgical clinical information, achieving an accuracy rate of 76.4% on the Korean general surgery board exam. However, it is important to recognize the limitations of large language models and ensure that they are used in conjunction with human expertise and judgment.

0 comments Cited 67 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Is Open Access

Performance of ChatGPT on the pharmacist licensing examination in Taiwan

Ying-Mei Wang, Hung-Wei Shen, Tzeng-Ji Chen (2023)

Background: ChatGPT is an artificial intelligence model trained for conversations. ChatGPT has been widely applied in general medical education and cardiology, but its application in pharmacy has been lacking. This study examined the accuracy of ChatGPT on the Taiwanese Pharmacist Licensing Examination and investigated its potential role in pharmacy education. Methods: ChatGPT was used on the first Taiwanese Pharmacist Licensing Examination in 2023 in Mandarin and English. The questions were entered manually one by one. Graphical questions, chemical formulae, and tables were excluded. Textual questions were scored according to the number of correct answers. Chart question scores were determined by multiplying the number and the correct rate of text questions. This study was conducted from March 5 to March 10, 2023, by using ChatGPT 3.5. Results: The correct rate of ChatGPT in Chinese and English questions was 54.4% and 56.9% in the first stage, and 53.8% and 67.6% in the second stage. On the Chinese test, only pharmacology and pharmacochemistry sections received passing scores. The English test scores were higher than the Chinese test scores across all subjects and were significantly higher in dispensing pharmacy and clinical pharmacy as well as therapeutics. Conclusion: ChatGPT 3.5 failed the Taiwanese Pharmacist Licensing Examination. Although it is not able to pass the examination, it can be improved quickly through deep learning. It reminds us that we should not only use multiple-choice questions to assess a pharmacist’s ability, but also use more variety of evaluations in the future. Pharmacy education should be changed in line with the examination, and students must be able to use AI technology for self-learning. More importantly, we need to help students develop humanistic qualities and strengthen their ability to interact with patients, so that they can become warm-hearted healthcare professionals.

0 comments Cited 40 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study

Yasutaka Yanagita, Daiki Yokokawa, Shun Uchida … (2023)

Background ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been trained to reduce the output of unethical sentences, the reliability of the training data is high and the usefulness of the output content is promising. Fortunately, in March 2023, a new version of ChatGPT, GPT-4, was released, which, according to internal evaluations, was expected to increase the likelihood of producing factual responses by 40% compared with its predecessor, GPT-3.5. The usefulness of this version of ChatGPT in English is widely appreciated. It is also increasingly being evaluated as a system for obtaining medical information in languages other than English. Although it does not reach a passing score on the national medical examination in Chinese, its accuracy is expected to gradually improve. Evaluation of ChatGPT with Japanese input is limited, although there have been reports on the accuracy of ChatGPT’s answers to clinical questions regarding the Japanese Society of Hypertension guidelines and on the performance of the National Nursing Examination. Objective The objective of this study is to evaluate whether ChatGPT can provide accurate diagnoses and medical knowledge for Japanese input. Methods Questions from the National Medical Licensing Examination (NMLE) in Japan, administered by the Japanese Ministry of Health, Labour and Welfare in 2022, were used. All 400 questions were included. Exclusion criteria were figures and tables that ChatGPT could not recognize; only text questions were extracted. We instructed GPT-3.5 and GPT-4 to input the Japanese questions as they were and to output the correct answers for each question. The output of ChatGPT was verified by 2 general practice physicians. In case of discrepancies, they were checked by another physician to make a final decision. The overall performance was evaluated by calculating the percentage of correct answers output by GPT-3.5 and GPT-4. Results Of the 400 questions, 292 were analyzed. Questions containing charts, which are not supported by ChatGPT, were excluded. The correct response rate for GPT-4 was 81.5% (237/292), which was significantly higher than the rate for GPT-3.5, 42.8% (125/292). Moreover, GPT-4 surpassed the passing standard (>72%) for the NMLE, indicating its potential as a diagnostic and therapeutic decision aid for physicians. Conclusions GPT-4 reached the passing standard for the NMLE in Japan, entered in Japanese, although it is limited to written questions. As the accelerated progress in the past few months has shown, the performance of the AI will improve as the large language model continues to learn more, and it may well become a decision support system for medical professionals by providing more accurate information.

0 comments Cited 31 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Ching-Hua Hsieh

Hsiao-Yun Hsieh

Hui-Ping Lin

Journal

Journal ID (nlm-ta): Heliyon

Journal ID (iso-abbrev): Heliyon

Title: Heliyon

Publisher: Elsevier

ISSN (Electronic): 2405-8440

Publication date PMC-release: 18 July 2024

Publication date Collection: 30 July 2024

Publication date (Electronic): 18 July 2024

Volume: 10

Issue: 14

Electronic Location Identifier: e34851

Affiliations

[1]Department of Plastic Surgery, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University and College of Medicine, Kaohsiung, 83301, Taiwan

Author notes

[* ]Corresponding author. Department of Plastic Surgery, Kaohsiung Chang Gung Memorial Hospital and Chang Gung University College of Medicine, Taiwan No.123, Ta-Pei Road, Niao-Song District, Kaohsiung City, 833, Taiwan. m93chinghua@ 123456gmail.com

Article

Publisher Item ID: S2405-8440(24)10882-1 Publisher ID: e34851

DOI: 10.1016/j.heliyon.2024.e34851

PMC ID: 11324965

PubMed ID: 39149010

SO-VID: 0da832b2-fde7-4f25-a962-73942777941f

License:

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

History

Date received : 23 May 2024

Date revision received : 27 June 2024

Date accepted : 17 July 2024

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 2

See all cited by

Most referenced authors 238

See all reference authors

Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination

Read this article at

Abstract

Background

Methods

Results

Conclusion

Related collections

SICOT-J (Orthopedic surgery and traumatology)

Most cited references 27

ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models

Performance of ChatGPT on the pharmacist licensing examination in Taiwan

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 124

Cited by 2

Most referenced authors 238