A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.

Related collections

Author and article information

Journal

Publication date Created: 21 July 2024

Article

ArXiV ID: 2407.15136

SO-VID: fcbe1a1f-9b03-4517-8c94-95d04555d218

License:

http://creativecommons.org/licenses/by-nc-sa/4.0/

History

Custom metadata

Categories cs.CL

ScienceOpen disciplines: Theoretical computer science

Data availability:

ScienceOpen disciplines: Theoretical computer science

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Read this article at

Abstract

Related collections

Blockchain in Healthcare Today

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 70