2
views
0
recommends
+1 Recommend
1 collections
    0
    shares

      Submit your digital health research with an established publisher
      - celebrating 25 years of open access

      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs’ responses create substantial risks, potentially threatening patients’ physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation.

          Objective

          We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks.

          Methods

          First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory.

          Results

          Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario.

          Conclusions

          MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.

          Related collections

          Most cited references33

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

          We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum

            Importance The rapid expansion of virtual health care has caused a surge in patient messages concomitant with more work and burnout among health care professionals. Artificial intelligence (AI) assistants could potentially aid in creating answers to patient questions by drafting responses that could be reviewed by clinicians. Objective To evaluate the ability of an AI chatbot assistant (ChatGPT), released in November 2022, to provide quality and empathetic responses to patient questions. Design, Setting, and Participants In this cross-sectional study, a public and nonidentifiable database of questions from a public social media forum (Reddit’s r/AskDocs) was used to randomly draw 195 exchanges from October 2022 where a verified physician responded to a public question. Chatbot responses were generated by entering the original question into a fresh session (without prior questions having been asked in the session) on December 22 and 23, 2022. The original question along with anonymized and randomly ordered physician and chatbot responses were evaluated in triplicate by a team of licensed health care professionals. Evaluators chose “which response was better” and judged both “the quality of information provided” ( very poor , poor , acceptable , good , or very good ) and “the empathy or bedside manner provided” ( not empathetic , slightly empathetic , moderately empathetic , empathetic , and very empathetic ). Mean outcomes were ordered on a 1 to 5 scale and compared between chatbot and physicians. Results Of the 195 questions and responses, evaluators preferred chatbot responses to physician responses in 78.6% (95% CI, 75.0%-81.8%) of the 585 evaluations. Mean (IQR) physician responses were significantly shorter than chatbot responses (52 [17-62] words vs 211 [168-245] words; t = 25.4; P < .001). Chatbot responses were rated of significantly higher quality than physician responses ( t = 13.3; P < .001). The proportion of responses rated as good or very good quality (≥ 4), for instance, was higher for chatbot than physicians (chatbot: 78.5%, 95% CI, 72.3%-84.1%; physicians: 22.1%, 95% CI, 16.4%-28.2%;). This amounted to 3.6 times higher prevalence of good or very good quality responses for the chatbot. Chatbot responses were also rated significantly more empathetic than physician responses ( t = 18.9; P < .001). The proportion of responses rated empathetic or very empathetic (≥4) was higher for chatbot than for physicians (physicians: 4.6%, 95% CI, 2.1%-7.7%; chatbot: 45.1%, 95% CI, 38.5%-51.8%; physicians: 4.6%, 95% CI, 2.1%-7.7%). This amounted to 9.8 times higher prevalence of empathetic or very empathetic responses for the chatbot. Conclusions In this cross-sectional study, a chatbot generated quality and empathetic responses to patient questions posed in an online forum. Further exploration of this technology is warranted in clinical settings, such as using chatbot to draft responses that physicians could then edit. Randomized trials could assess further if using AI assistants might improve responses, lower clinician burnout, and improve patient outcomes.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine

                Bookmark

                Author and article information

                Contributors
                Journal
                JMIR Med Inform
                JMIR Med Inform
                JMI
                medinform
                7
                JMIR Medical Informatics
                JMIR Medical Informatics
                2291-9694
                2024
                28 June 2024
                : 12
                : e57674
                Affiliations
                [1 ]departmentShanghai Artificial Intelligence Laboratory , OpenMedLab , Shanghai, China
                [2 ]departmentClinical Research and Innovation Unit , Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine , Shanghai, China
                [3 ]departmentWest China Biomedical Big Data Center, West China Hospital , Sichuan University , Chengdu, China
                [4 ]departmentMed-X Center for Informatics , Sichuan University , Chengdu, China
                Author notes
                ShaotingZhangPhD, Shanghai Artificial Intelligence Laboratory, OpenMedLab, West Bank International Artificial Intelligence Center, 701 Yunjin Road, Shanghai, 200032, China, 86 021-23537800; zhangshaoting@ 123456pjlab.org.cn

                None declared.

                Author information
                http://orcid.org/0000-0001-9233-4363
                http://orcid.org/0000-0002-8834-1947
                http://orcid.org/0000-0001-9868-0136
                http://orcid.org/0000-0001-5757-4804
                http://orcid.org/0009-0005-0399-7656
                http://orcid.org/0009-0005-4902-1034
                http://orcid.org/0000-0003-3845-8079
                http://orcid.org/0000-0002-8136-9816
                http://orcid.org/0009-0003-7223-5298
                http://orcid.org/0000-0002-8719-448X
                Article
                57674
                10.2196/57674
                11225096
                38952020
                d56566ac-5125-46c8-a04d-673ad440c409
                Copyright © Jie Xu, Lu Lu, Xinwei Peng, Jinru Ding, Jiali Pang, Lingrui Yang, Huan Song, Kang Li, Xin Sun, Shaoting Zhang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org)

                This is an open-access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

                History
                : 25 February 2024
                : 03 April 2024
                : 04 May 2024
                Categories
                Original Paper
                Natural Language Processing
                Generative Language Models Including ChatGPT
                AI Language Models in Health Care
                Machine Learning
                Chatbots and Conversational Agents
                Artificial Intelligence
                Formative Evaluation of Digital Health Interventions

                chatgpt,llm,assessment,data set,benchmark,medicine
                chatgpt, llm, assessment, data set, benchmark, medicine

                Comments

                Comment on this article