55
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Purpose

          ChatGPT is a large language model trained on a large dataset covering a broad range of topics, including the medical literature. We aim to examine its accuracy and reproducibility in answering patient questions regarding bariatric surgery.

          Materials and methods

          Questions were gathered from nationally regarded professional societies and health institutions as well as Facebook support groups. Board-certified bariatric surgeons graded the accuracy and reproducibility of responses. The grading scale included the following: (1) comprehensive, (2) correct but inadequate, (3) some correct and some incorrect, and (4) completely incorrect. Reproducibility was determined by asking the model each question twice and examining difference in grading category between the two responses.

          Results

          In total, 151 questions related to bariatric surgery were included. The model provided “comprehensive” responses to 131/151 (86.8%) of questions. When examined by category, the model provided “comprehensive” responses to 93.8% of questions related to “efficacy, eligibility and procedure options”; 93.3% related to “preoperative preparation”; 85.3% related to “recovery, risks, and complications”; 88.2% related to “lifestyle changes”; and 66.7% related to “other”. The model provided reproducible answers to 137 (90.7%) of questions.

          Conclusion

          The large language model ChatGPT often provided accurate and reproducible responses to common questions related to bariatric surgery. ChatGPT may serve as a helpful adjunct information resource for patients regarding bariatric surgery in addition to standard of care provided by licensed healthcare professionals. We encourage future studies to examine how to leverage this disruptive technology to improve patient outcomes and quality of life.

          Graphical Abstract

          Supplementary Information

          The online version contains supplementary material available at 10.1007/s11695-023-06603-5.

          Related collections

          Most cited references22

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

          Background Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results Of the 4 data sets, AMBOSS-Step1 , AMBOSS-Step2 , NBME-Free-Step1 , and NBME-Free-Step2 , ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased ( P =.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 ( P <.001) and NBME-Free-Step2 ( P =.001) data sets, respectively. Conclusions ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Benefits and Risks of Bariatric Surgery in Adults: A Review

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Training language models to follow instructions with human feedback

              Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
                Bookmark

                Author and article information

                Contributors
                jamil.samaan@gmail.com
                Journal
                Obes Surg
                Obes Surg
                Obesity Surgery
                Springer US (New York )
                0960-8923
                1708-0428
                27 April 2023
                27 April 2023
                2023
                : 33
                : 6
                : 1790-1796
                Affiliations
                [1 ]GRID grid.50956.3f, ISNI 0000 0001 2152 9905, Karsh Division of Gastroenterology and Hepatology, , Cedars-Sinai Medical Center, ; 8700 Beverly Blvd, Los Angeles, CA 90048 USA
                [2 ]GRID grid.42505.36, ISNI 0000 0001 2156 6853, Division of Upper GI and General Surgery, Department of Surgery, Health Care Consultation Center, , Keck School of Medicine of USC, ; 1510 San Pablo St. #514, Los Angeles, CA 90033 USA
                [3 ]GRID grid.5337.2, ISNI 0000 0004 1936 7603, Bristol Medical School, , University of Bristol, ; 5 Tyndall Ave, Bristol, BS8 1UD UK
                [4 ]GRID grid.50956.3f, ISNI 0000 0001 2152 9905, Department of Surgery, Cedars-Sinai Medical Center, ; 8700 Beverly Blvd, Los Angeles, CA 90048 USA
                [5 ]GRID grid.50956.3f, ISNI 0000 0001 2152 9905, Department of Psychiatry and Behavioral Sciences, Cedars-Sinai Medical Center, ; 8700 Beverly Blvd, Los Angeles, CA 90048 USA
                [6 ]GRID grid.50956.3f, ISNI 0000 0001 2152 9905, Division of Health Services Research, Department of Medicine, , Cedars-Sinai Medical Center, ; 8700 Beverly Blvd, Los Angeles, CA 90048 USA
                Author information
                http://orcid.org/0000-0002-6191-2631
                Article
                6603
                10.1007/s11695-023-06603-5
                10234918
                37106269
                68b4088d-9c09-4fcd-af3d-510efbfb8ff9
                © The Author(s) 2023

                Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 21 February 2023
                : 10 April 2023
                : 17 April 2023
                Funding
                Funded by: Cedars-Sinai Medical Center
                Categories
                Original Contributions
                Custom metadata
                © Springer Science+Business Media, LLC, part of Springer Nature 2023

                Surgery
                artificial intelligence,chatgpt,language learning models,bariatric surgery,weight loss,health literacy

                Comments

                Comment on this article

                scite_
                0
                0
                0
                0
                Smart Citations
                0
                0
                0
                0
                Citing PublicationsSupportingMentioningContrasting
                View Citations

                See how this article has been cited at scite.ai

                scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

                Similar content136

                Cited by77

                Most referenced authors281