This abstract covers two areas of data management research in which formal language
theory plays a central role, namely in Information Extraction and Graph Databases.
Information Extraction. Automata-based foundations of Information Extraction (e.g.,
[9, 16, 34]) have become a popular research topic over the last years. One framework
that has been studied in this context is that of document spanners [16]. Document
spanners model information extraction tasks as functions that map input text documents
to a relation of spans, i.e., intervals of start and end positions in the text. A
particular interesting class of spanners is the class of regular spanners, which is
based on regular languages with capture variables. This class satisfies a number of
interesting complexity and expressiveness properties and therefore caused a revival
of automata- and formal language techniques in database research. Examples of such
work are on the enumeration of answers [1, 18], expressiveness [20, 21, 33], complexity
issues [22, 29, 32], integration of weights [15], and distributed evaluation [14].
That said, the document spanners framework is not the only one that is studied in
this context, and there are other elegant frameworks that can express information
extraction functions beyond the spanner framework, e.g., [9, 34].
Graph Databases. Formal languages have played a central role in Graph Databases since
the SIGMOD 1987 paper of Cruz et al. [12], which is one of the first and most influential
papers on the topic. Indeed, this paper introduced regular expressions for querying
paths (later named regular path queries or RPQs), which are still used in graph query
languages today [13, 19, 24]. This early work on Graph Databases only allowed RPQs
to match simple paths in graphs, i.e., paths without repeated nodes. However, after
discovering that this restriction already makes simple queries difficult to evaluate
[30], it was largely abandoned by the research community, and a huge body of research
on fundamental problems followed, in which RPQs were allowed to match all paths. This
line of work is too extensive to discuss here, but its state until 2013 is nicely
surveyed by Barceló [7]. It still produces high-quality and exciting results today
(e.g., [8, 17]).
Perhaps ironically, the simple paths and the similar trail restriction (which only
allows paths without repeated edges) resurfaced on the systems side of graph databases.
Indeed, an early incarnation1 of SPARQL 1.1 [24] used (a variant of) the simple path
restriction, whereas the default semantics of Cypher [13] uses the trail restriction.
This new development on the practical side of graph query languages motivated several
research groups to build a scientific basis that can be used to guide the design of
RPQs in graph query languages [5, 6, 26, 27]. Furthermore, it seems that, in order
to understand RPQ evaluation in practical graph query languages, it is very useful
to combine fundamental research with query log analysis [10, 11, 28].
To conclude, it seems that the research communities’ efforts to connect theory and
practice (which go far beyond what I’ve been able to mention here, see, e.g. [2–4,
25, 31, 35]) are paying off, in the sense that we are now experiencing an increased
interaction between researchers and practitioners in the Graph Query Language (GQL)
Standard initiative [23] and the process that has led to it.2 The GQL initiative was
recently inaugurated as an official ISO project that aims at becoming an international
standard for graph database querying. Furthermore, the story does not stop here at
all—a large number of initiatives is currently brainstorming on next-generation logical
foundations of graph databases and their query languages, schema languages for graphs,
etc.