Evaluating the Performance of ChatGPT in Ophthalmology :  An Analysis of Its Successes and Shortcomings

Author(s): Fares Antaki , MD, CM ¹ ^, ² ^, ³ ^, ⁴ , Samir Touma , MD, CM ¹ ^, ² ^, ³ , Daniel Milad , MD ¹ ^, ² ^, ³ , Jonathan El-Khoury , MD ¹ ^, ² ^, ³ , Renaud Duval , MD ¹ ^, ² ^, ^∗

Publication date (Electronic): 05 May 2023

Journal: Ophthalmology Science

Publisher: Elsevier

Keywords: Artificial intelligence, ChatGPT, Generative Pretrained Transformer, Medical education, Ophthalmology

Read this article at

ScienceOpen Publisher PMC

Bookmark

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Purpose

Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space.

Design

Evaluation of diagnostic test or technology.

Participants

ChatGPT is a publicly available LLM.

Methods

We tested 2 versions of ChatGPT (January 9 “legacy” and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey’s test to decide if there were meaningful differences between the tested subspecialties.

Main Outcome Measures

We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT’s outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05.

Results

The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT’s answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology ( P < 0.001) and ocular pathology ( P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections.

Conclusion

ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties.

Financial Disclosure(s)

Proprietary or commercial disclosure may be found after the references.

Related collections

Most cited references 21

Record: found
Abstract: not found
Article: not found

The Measurement of Observer Agreement for Categorical Data

J Landis, Gary G. Koch (1977)

0 comments Cited 8386 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder … (2020)

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. 40+32 pages

0 comments Cited 1033 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla … (2023)

We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.

0 comments Cited 952 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Renaud Duval

Journal

Journal ID (nlm-ta): Ophthalmol Sci

Journal ID (iso-abbrev): Ophthalmol Sci

Title: Ophthalmology Science

Publisher: Elsevier

ISSN (Electronic): 2666-9145

Publication date PMC-release: 05 May 2023

Publication date Collection: December 2023

Publication date (Electronic): 05 May 2023

Volume: 3

Issue: 4

Electronic Location Identifier: 100324

Affiliations

[1 ]Department of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada

[2 ]Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l’Est-de-l’Île-de-Montréal, Montréal, Quebec, Canada

[3 ]Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, Canada

[4 ]The CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, Canada

Author notes

[∗ ]Correspondence: Renaud Duval, MD, Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, 5415 Boulevard de l'Assomption, Montréal, Québec, Canada, H1T 2M4. renaud.duval@ 123456gmail.com

Article

Publisher Item ID: S2666-9145(23)00056-8 Publisher ID: 100324

DOI: 10.1016/j.xops.2023.100324

PMC ID: 10272508

PubMed ID: 37334036

SO-VID: 0c5d20cc-63b9-43dd-90c0-65437f1ac6b7

License:

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

History

Date received : 3 February 2023

Date revision received : 21 April 2023

Date accepted : 25 April 2023

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Evaluating the Performance of ChatGPT in Ophthalmology : An Analysis of Its Successes and Shortcomings

Read this article at

Abstract

Purpose

Design

Participants

Methods

Main Outcome Measures

Results

Conclusion

Financial Disclosure(s)

Related collections

Karger: Ophthalmology

Most cited references 21

The Measurement of Observer Agreement for Categorical Data

Language Models are Few-Shot Learners

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 257

Cited by 132

Most referenced authors 841