Automatic Identification of Closely-related Indian Languages: Resources
  and Experiments

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.

Related collections

Most cited references 10

Record: found
Abstract: not found
Article: not found

Language identification from small text samples*

Kavi Narayana Murthy, G. Kumar (2006)

0 comments Cited 5 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Dialect Differences and Social Stratification in a North Indian Village

John Gumperz (1958)

0 comments Cited 5 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

Language Indentification: How to Distinguish Similar Languages?

Nikola Ljubesic, Nives Mikelic, Damir Boras (2007)

0 comments Cited 4 times – based on 0 reviews

Bookmark

All references

Author and article information

Journal

Publication date Created: 26 March 2018

Article

ArXiV ID: 1803.09405

SO-VID: 6364a81f-86f6-4cba-8f46-39a1ce8fdf7f

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments Paper accepted at the 4th Workshop in Indian Languages Data and Resources (WILDRE - 4), 11th edition of the Language Resources and Evaluation Conference (LREC - 2018), 7-12 May 2018, Miyazaki (Japan)

Categories cs.CL

Data availability:

Automatic Identification of Closely-related Indian Languages: Resources and Experiments

Read this article at

Abstract

Related collections

Resource Identification

Most cited references 10

Language identification from small text samples*

Dialect Differences and Social Stratification in a North Indian Village

Language Indentification: How to Distinguish Similar Languages?

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 183

Most referenced authors 28