Formal Languages in Information Extraction and Graph Databases

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

This abstract covers two areas of data management research in which formal language theory plays a central role, namely in Information Extraction and Graph Databases. Information Extraction. Automata-based foundations of Information Extraction (e.g., [9, 16, 34]) have become a popular research topic over the last years. One framework that has been studied in this context is that of document spanners [16]. Document spanners model information extraction tasks as functions that map input text documents to a relation of spans, i.e., intervals of start and end positions in the text. A particular interesting class of spanners is the class of regular spanners, which is based on regular languages with capture variables. This class satisfies a number of interesting complexity and expressiveness properties and therefore caused a revival of automata- and formal language techniques in database research. Examples of such work are on the enumeration of answers [1, 18], expressiveness [20, 21, 33], complexity issues [22, 29, 32], integration of weights [15], and distributed evaluation [14]. That said, the document spanners framework is not the only one that is studied in this context, and there are other elegant frameworks that can express information extraction functions beyond the spanner framework, e.g., [9, 34]. Graph Databases. Formal languages have played a central role in Graph Databases since the SIGMOD 1987 paper of Cruz et al. [12], which is one of the first and most influential papers on the topic. Indeed, this paper introduced regular expressions for querying paths (later named regular path queries or RPQs), which are still used in graph query languages today [13, 19, 24]. This early work on Graph Databases only allowed RPQs to match simple paths in graphs, i.e., paths without repeated nodes. However, after discovering that this restriction already makes simple queries difficult to evaluate [30], it was largely abandoned by the research community, and a huge body of research on fundamental problems followed, in which RPQs were allowed to match all paths. This line of work is too extensive to discuss here, but its state until 2013 is nicely surveyed by Barceló [7]. It still produces high-quality and exciting results today (e.g., [8, 17]). Perhaps ironically, the simple paths and the similar trail restriction (which only allows paths without repeated edges) resurfaced on the systems side of graph databases. Indeed, an early incarnation1 of SPARQL 1.1 [24] used (a variant of) the simple path restriction, whereas the default semantics of Cypher [13] uses the trail restriction. This new development on the practical side of graph query languages motivated several research groups to build a scientific basis that can be used to guide the design of RPQs in graph query languages [5, 6, 26, 27]. Furthermore, it seems that, in order to understand RPQ evaluation in practical graph query languages, it is very useful to combine fundamental research with query log analysis [10, 11, 28]. To conclude, it seems that the research communities’ efforts to connect theory and practice (which go far beyond what I’ve been able to mention here, see, e.g. [2–4, 25, 31, 35]) are paying off, in the sense that we are now experiencing an increased interaction between researchers and practitioners in the Graph Query Language (GQL) Standard initiative [23] and the process that has led to it.2 The GQL initiative was recently inaugurated as an official ISO project that aims at becoming an international standard for graph database querying. Furthermore, the story does not stop here at all—a large number of initiatives is currently brainstorming on next-generation logical foundations of graph databases and their query languages, schema languages for graphs, etc.

Related collections

Most cited references 9

Record: found
Abstract: not found
Article: not found

Semantics and complexity of SPARQL

Claudio Gutierrez, Jorge Pérez, Marcelo Arenas (2009)

0 comments Cited 103 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Finding Regular Simple Paths in Graph Databases

Alberto O. Mendelzon, Peter T. Wood (1995)

0 comments Cited 29 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Foundations of Modern Query Languages for Graph Databases

Renzo Angles, Marcelo Arenas, Pablo Barcelo … (2017)

0 comments Cited 23 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Marcella Anselmo: manselmo@unisa.it

Gianluca Della Vedova: gianluca.dellavedova@unimib.it

Florin Manea: flmanea@gmail.com

Arno Pauly: arno.m.pauly@gmail.com

Wim Martens: wim.martens@uni-bayreuth.de

Journal

Journal ID (publisher-id): 978-3-030-51466-2

Journal ID (doi): 10.1007/978-3-030-51466-2

Journal ID (nlm-ta): Beyond the Horizon of Computability

Title: Beyond the Horizon of Computability

Subtitle: 16th Conference on Computability in Europe, CiE 2020, Fisciano, Italy, June 29–July 3, 2020, Proceedings

ISBN (Print): 978-3-030-51465-5

ISBN (Electronic): 978-3-030-51466-2

Publication date (Electronic): 24 June 2020

Volume: 12098

Pages: 306-309

Affiliations

[8 ]GRID grid.11780.3f, ISNI 0000 0004 1937 0335, University of Salerno, ; Fisciano, Italy

[9 ]GRID grid.7563.7, ISNI 0000 0001 2174 1754, University of Milano-Bicocca, ; Milan, Italy

[10 ]GRID grid.7450.6, ISNI 0000 0001 2364 4210, University of Göttingen, ; Göttingen, Germany

[11 ]GRID grid.4827.9, ISNI 0000 0001 0658 8800, Swansea University, ; Swansea, UK

GRID grid.7384.8, ISNI 0000 0004 0467 6972, University of Bayreuth, ; Bayreuth, Germany

Article

Publisher ID: 28

DOI: 10.1007/978-3-030-51466-2_28

PMC ID: 7309490

SO-VID: f54054fd-c3af-4b29-bf0d-8a88aa022c0c

License:

This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.

History

Custom metadata

Data availability:

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Most referenced authors 57

See all reference authors

Formal Languages in Information Extraction and Graph Databases

Read this article at

Abstract

Related collections

African Languages

Most cited references 9

Semantics and complexity of SPARQL

Finding Regular Simple Paths in Graph Databases

Foundations of Modern Query Languages for Graph Databases

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 27

Most referenced authors 57