21
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Evaluating the Performance of ChatGPT in Ophthalmology : An Analysis of Its Successes and Shortcomings

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Purpose

          Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space.

          Design

          Evaluation of diagnostic test or technology.

          Participants

          ChatGPT is a publicly available LLM.

          Methods

          We tested 2 versions of ChatGPT (January 9 “legacy” and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey’s test to decide if there were meaningful differences between the tested subspecialties.

          Main Outcome Measures

          We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT’s outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05.

          Results

          The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT’s answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology ( P < 0.001) and ocular pathology ( P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections.

          Conclusion

          ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties.

          Financial Disclosure(s)

          Proprietary or commercial disclosure may be found after the references.

          Related collections

          Most cited references21

          • Record: found
          • Abstract: not found
          • Article: not found

          The Measurement of Observer Agreement for Categorical Data

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Language Models are Few-Shot Learners

            Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. 40+32 pages
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

              We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.
                Bookmark

                Author and article information

                Contributors
                Journal
                Ophthalmol Sci
                Ophthalmol Sci
                Ophthalmology Science
                Elsevier
                2666-9145
                05 May 2023
                December 2023
                05 May 2023
                : 3
                : 4
                : 100324
                Affiliations
                [1 ]Department of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada
                [2 ]Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l’Est-de-l’Île-de-Montréal, Montréal, Quebec, Canada
                [3 ]Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, Canada
                [4 ]The CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, Canada
                Author notes
                []Correspondence: Renaud Duval, MD, Centre Universitaire d’Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, 5415 Boulevard de l'Assomption, Montréal, Québec, Canada, H1T 2M4. renaud.duval@ 123456gmail.com
                Article
                S2666-9145(23)00056-8 100324
                10.1016/j.xops.2023.100324
                10272508
                37334036
                0c5d20cc-63b9-43dd-90c0-65437f1ac6b7
                © 2023 by the American Academy of Ophthalmology.

                This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

                History
                : 3 February 2023
                : 21 April 2023
                : 25 April 2023
                Categories
                Original Article

                artificial intelligence,chatgpt,generative pretrained transformer,medical education,ophthalmology

                Comments

                Comment on this article

                scite_
                0
                0
                0
                0
                Smart Citations
                0
                0
                0
                0
                Citing PublicationsSupportingMentioningContrasting
                View Citations

                See how this article has been cited at scite.ai

                scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

                Similar content257

                Cited by132

                Most referenced authors841