Inviting an author to review:
Find an author and click ‘Invite to review selected article’ near their name.
Search for authorsSearch for similar articles
1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The Two Word Test as a semantic benchmark for large language models

      research-article
      1 , 2 , 2 ,
      Scientific Reports
      Nature Publishing Group UK
      Psychology, Computer science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Large language models (LLMs) have shown remarkable abilities recently, including passing advanced professional exams and demanding benchmark tests. This performance has led many to suggest that they are close to achieving humanlike or “true” understanding of language, and even artificial general intelligence (AGI). Here, we provide a new open-source benchmark, the Two Word Test (TWT), that can assess semantic abilities of LLMs using two-word phrases in a task that can be performed relatively easily by humans without advanced training. Combining multiple words into a single concept is a fundamental linguistic and conceptual operation routinely performed by people. The test requires meaningfulness judgments of 1768 noun-noun combinations that have been rated as meaningful (e.g., baby boy) or as having low meaningfulness (e.g., goat sky) by human raters. This novel test differs from existing benchmarks that rely on logical reasoning, inference, puzzle-solving, or domain expertise. We provide versions of the task that probe meaningfulness ratings on a 0–4 scale as well as binary judgments. With both versions, we conducted a series of experiments using the TWT on GPT-4, GPT-3.5, Claude-3-Optus, and Gemini-1-Pro-001. Results demonstrated that, compared to humans, all models performed relatively poorly at rating meaningfulness of these phrases. GPT-3.5-turbo, Gemini-1.0-Pro-001 and GPT-4-turbo were also unable to make binary discriminations between sensible and nonsense phrases, with these models consistently judging nonsensical phrases as making sense. Claude-3-Opus made a substantial improvement in binary discrimination of combinatorial phrases but was still significantly worse than human performance. The TWT can be used to understand and assess the limitations of current LLMs, and potentially improve them. The test also reminds us that caution is warranted in attributing “true” or human-level understanding to LLMs based only on tests that are challenging for humans.

          Related collections

          Most cited references11

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

          We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Comparing an Individual's Test Score Against Norms Derived from Small Samples

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              ChatGPT Goes to Law School

                Bookmark

                Author and article information

                Contributors
                rutvik@sc.edu
                Journal
                Sci Rep
                Sci Rep
                Scientific Reports
                Nature Publishing Group UK (London )
                2045-2322
                16 September 2024
                16 September 2024
                2024
                : 14
                : 21593
                Affiliations
                [1 ]Department of Communication Sciences and Disorders, University of South Carolina, ( https://ror.org/02b6qw903) Columbia, 29208 USA
                [2 ]Department of Psychology, University of South Carolina, ( https://ror.org/02b6qw903) Columbia, 29208 USA
                Article
                72528
                10.1038/s41598-024-72528-3
                11405709
                39284863
                8020c527-3b8d-4af6-9233-a1fba1cf8c69
                © The Author(s) 2024

                Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

                History
                : 7 May 2024
                : 9 September 2024
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100000002, National Institutes of Health;
                Award ID: 5R01DC017162-02
                Award ID: 5R01DC017162-02
                Award ID: 5R01DC017162-02
                Award Recipient :
                Categories
                Article
                Custom metadata
                © Springer Nature Limited 2024

                Uncategorized
                psychology,computer science
                Uncategorized
                psychology, computer science

                Comments

                Comment on this article

                scite_
                0
                0
                0
                0
                Smart Citations
                0
                0
                0
                0
                Citing PublicationsSupportingMentioningContrasting
                View Citations

                See how this article has been cited at scite.ai

                scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

                Similar content127

                Most referenced authors76