Performance analysis of large language models in the domain of legal argument mining

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.

Related collections

Most cited references 74

Record: found
Abstract: found
Article: not found

What is a support vector machine?

William Noble (2006)

Support vector machines (SVMs) are becoming popular in a wide variety of biological applications. But, what exactly are SVMs and how do they work? And what are their most promising applications in the life sciences?

0 comments Cited 835 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder … (2020)

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. 40+32 pages

0 comments Cited 603 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

Language models are unsupervised multitask learners

A Radford, J Wu, R. Child … (2019)

0 comments Cited 353 times – based on 0 reviews

Bookmark

All references

Author and article information

Contributors

Abdullah Al Zubaer: URI : http://loop.frontiersin.org/people/2263904/overviewRole: Role:

Michael Granitzer: URI : http://loop.frontiersin.org/people/1041138/overviewRole:

Jelena Mitrović: URI : http://loop.frontiersin.org/people/2503968/overviewRole:

Journal

Journal ID (nlm-ta): Front Artif Intell

Journal ID (iso-abbrev): Front Artif Intell

Journal ID (publisher-id): Front. Artif. Intell.

Title: Frontiers in Artificial Intelligence

Publisher: Frontiers Media S.A.

ISSN (Electronic): 2624-8212

Publication date (Electronic): 17 November 2023

Publication date Collection: 2023

Volume: 6

Electronic Location Identifier: 1278796

Affiliations

[1] ¹Faculty of Computer Science and Mathematics, Chair of Data Science, University of Passau , Passau, Germany

[2] ²Group for Human Computer Interaction, Institute for Artificial Intelligence Research and Development of Serbia , Novi Sad, Serbia

Author notes

Edited by: Juliano Rabelo, University of Alberta, Canada

Reviewed by: Masaharu Yoshioka, Hokkaido University, Japan; Constantin Orasan, University of Surrey, United Kingdom

*Correspondence: Abdullah Al Zubaer abdullahal.zubaer@ 123456uni-passau.de

Article

DOI: 10.3389/frai.2023.1278796

PMC ID: 10691378

PubMed ID: 38045763

SO-VID: 5bb2048b-c709-4a6a-a840-8af53a4aef06

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

History

Date received : 16 August 2023

Date accepted : 25 October 2023

Page count

Figures: 4, Tables: 9, Equations: 1, References: 93, Pages: 18, Words: 13789

Funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. The report has been funded by the German Federal Ministry of Education and Research (BMBF) under the projects DeepWrite (Grant. No. 16DHBKI059) and CAROLL (Grant. No. 01-S20049). The authors are responsible for the content of this publication.

Custom metadata

section-at-acceptance Technology and Law

Keywords: natural language processing (nlp),argument mining,legal data,european court of human rights (echr),sequence classification,gpt-4,chatgpt,large language models

Data availability:

Keywords: natural language processing (nlp), argument mining, legal data, european court of human rights (echr), sequence classification, gpt-4, chatgpt, large language models

Performance analysis of large language models in the domain of legal argument mining

Read this article at

Abstract

Related collections

UUM Journal of Legal Studies

Most cited references 74

What is a support vector machine?

Language Models are Few-Shot Learners

Language models are unsupervised multitask learners

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 108

Most referenced authors 1,118