Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Rapid advances in next-generation sequencing technologies have dramatically changed our ability to perform genome-scale analyses. The human reference genome used for most genomic analyses represents only a small number of individuals, limiting its usefulness for genotyping. We designed a novel method, HISAT2, for representing and searching an expanded model of the human reference genome, in which a large catalogue of known genomic variants and haplotypes is incorporated into the data structure used for searching and alignment. This strategy for representing a population of genomes, along with a fast and memory-efficient search algorithm, enables more detailed and accurate variant analyses than previous methods. We demonstrate two initial applications of HISAT2: HLA typing, a critical need in human organ transplantation, and DNA fingerprinting, widely used in forensics. These applications are part of HISAT-genotype, with performance not only surpassing earlier computational methods, but matching or exceeding the accuracy of laboratory-based assays.

Related collections

Most cited references 5

Record: found
Abstract: found
Article: not found

Why are de Bruijn graphs useful for genome assembly?

Phillip Compeau, Pavel A Pevzner, Glenn Tesler (2011)

Assembling billions of short sequencing reads into a contiguous genome is a formidable challenge The development of algorithmic ideas for Next-Generation Sequencing (NGS) can be traced back three hundred years to the Prussian city of Königsberg (present-day Kaliningrad, Russia), where seven bridges joined the four parts of the city located on opposing banks of the Pregel River and two river islands (Fig. 1a). Königsberg’s residents enjoyed strolling through the city, and they wondered: is it possible to visit every part of the city by walking across each of the seven bridges exactly once and returning to one’s starting location? Remarkably, the conceptual breakthrough used in 1735 to solve this Bridges of Königsberg Problem by the great mathematician Leonhard Euler 1 also enables the assembly of billions of short sequencing reads. Euler’s first insight was to represent each landmass as a point (called a node) and each bridge as a line segment (called an edge) connecting the appropriate two points. This creates a graph—a network of nodes connected by edges (Fig. 1b). By describing a procedure for determining whether an arbitrary graph contains an Eulerian cycle (a path through the graph that visits every edge exactly once and returns back where it started), Euler not only resolved the Bridges of Königsberg Problem but also effectively launched the entire branch of mathematics known today as graph theory 2 . Computational issues arise from alignment-based assembly To illustrate why graphs are useful for genome assembly, we will use a simple example with five very short reads (CGTGCAA, ATGGCGT, CAATGGC, GGCGTGC and TGCAATG) sequenced from a small circular genome, ATGGCGTGCA (Fig. 2a). Current NGS methods produce reads that vary in length, but the most popular technology today generates approximately 100-nucleotide reads. A straightforward method for assembling reads into longer contiguous sequences—and the one used for assembling the human genome 3,4 in 2001 as well as for all other projects based on Sanger sequencing—uses a graph in which each read is represented by a node and overlap between reads is represented by an arrow (called a directed edge) joining two reads. For instance, two nodes representing reads may be connected with a directed edge if the reads overlap by at least five nucleotides (Fig. 2b). Visualizing an ant walking along the edges of this graph provides a useful illustrative aid for understanding a broad class of algorithms used to derive insights from graphs. In the case of genome assembly, the ant’s path traces a series of overlapping reads, and thus represents a candidate assembly. Specifically, if the ant follows the path AT GGCGT → GG CGTGC → CG TGCAA → TG CAATG → CA ATGGC → AT GGCGT, its walk induces a Hamiltonian cycle in our graph, which is a cycle that travels to every node exactly once (but closes with the starting node), meaning that each read will be included once in the assembly. The circular “genome” ATGGCGTGCA resulting from a Hamiltonian cycle contains all five reads and thus reconstructs the original genome (although we may have to “wrap around” the genome, for example in order to locate CAATGGC in ATGGCGTGCA). Modern assemblers usually work with strings of a particular length k (k-mers), which are shorter than entire reads (see Box 2 for an explanation of why researchers prefer k-mers to reads). For example, a 100-nucleotide read may be divided into 46 overlapping 55-mers. We can generalize the Hamiltonian Cycle approach to k-mers by constructing a graph as follows. First, from a set of reads, form a node for every k-mer appearing in these reads. Second, given a k-mer, define its prefix as the string formed by all its nucleotides except the final one and its suffix as the string formed by all its nucleotides except the first one. Connect one k-mer to another with a directed edge if the suffix of the former equals the prefix of the latter—that is, if the two k-mers completely overlap except for one nucleotide at each end (Fig. 2c). Third, look for a Hamiltonian cycle, which represents a candidate genome because it visits each detected k-mer; moreover, that path will also have minimal length because a Hamiltonian cycle travels to each k-mer exactly once. However, this method is not as easy to implement as it might seem. Imagine attempting to create a similar graph for a single run of an Illumina sequencer that generates many reads. A million (106) reads will require a trillion (1012) pairwise alignments. A billion (109) reads necessitate a quintillion (1018) alignments. What’s more, there is no known efficient algorithm for finding a Hamiltonian cycle in a large graph with millions (let alone billions) of nodes. The Hamiltonian cycle approach 5,6 was feasible for sequencing the first microbial genome 7 in 1995 and the human genome in 2001, as well as for all other projects based on Sanger sequencing. However, the computational burden was so large that most NGS sequencing projects have abandoned the Hamiltonian cycle approach. And here is where genome sequencing faces the limits of modern computer science: the computational problem of finding a Hamiltonian cycle belongs to a class of problems that are collectively called NP-Complete (see ref. 2 for further background). To this day, some of the world’s top computer scientists have worked to find an efficient solution to any NP-Complete problem, with no success. What makes their failure doubly frustrating is that neither has anyone been able to prove that NP-Complete problems are intractable; efficient solutions to these problems may actually exist, but such solutions have not yet been discovered. Scalable assembly with de Bruijn graphs We have observed that finding a cycle visiting all nodes of a graph exactly once (called the Hamiltonian cycle problem) is a difficult computational problem; however, as we will soon see, finding a cycle visiting all edges of a graph exactly once is much easier. This algorithmic contrast has motivated computer scientists to cast fragment assembly as such a problem. So instead of assigning each k-mer to a node, we will now assign each k-mer located within a read to an edge. This allows the construction of a de Bruijn graph, which we call E, as follows. First, form a node for every distinct prefix or suffix of a k-mer, meaning that a given sequence of length k − 1 can appear only once as a node of the graph. Then, connect node x to node y with a directed edge if some k-mer has prefix x and suffix y, and label the edge with this k-mer (Fig. 2d). For a discussion on the origin of de Bruijn graphs, see Box 1. Now imagine an ant that follows a different strategy: instead of visiting every node of the graph (as before), it now visits every edge of E exactly once. Sound familiar? This is exactly the kind of path that would solve the Bridges of Königsberg Problem and is called an Eulerian cycle. Since it visits all edges of E, which represent all possible k-mers, this new ant also spells out a candidate genome: for each edge that the ant traverses, one tacks on the first nucleotide of the k-mer assigned to that edge. Euler considered graphs for which there exists a path between every two nodes (called connected graphs). He proved that a connected graph with undirected edges contains an Eulerian cycle exactly when every node in the graph has an even number of edges touching it. For the Königsberg Bridge Graph, this is not the case because each of the four nodes has an odd number of edges touching it (Fig. 1b), and so the desired stroll through the city does not exist. The case of directed graphs (i.e. graphs with directed edges) is similar. For any node v in a directed graph, define the indegree of v as the number of edges leading into v and the outdegree of v as the number of edges leaving v. A graph in which indegrees are equal to outdegrees for all nodes is called balanced. Euler’s theorem states that a connected directed graph has an Eulerian cycle if and only if it is balanced. In particular, Euler’s theorem implies that the graph E contains an Eulerian cycle as long as we have located all k-mers present in the genome. Indeed, in this case, for any node, both its indegree and outdegree represent the number of times the (k − 1)-mer assigned to that node occurs in the genome. It is easy to see that a graph possessing an Eulerian cycle is balanced because every time an ant traversing an Eulerian cycle passes through a particular vertex, it enters on one edge of the cycle and exits on the next edge. This pairs up all the edges touching each vertex, showing that half the edges on the vertex lead into it and half lead out from it.. It is a bit harder to see that every connected balanced graph contains an Eulerian cycle. To prove this fact, Euler sent an ant to randomly explore the graph under a single constraint: the ant cannot traverse a previously traversed edge. Sooner or later, the ant must get stuck at a certain node (with all outgoing edges previously traversed), and Euler noticed that because the graph is balanced, this “no exit” node is exactly the vertex where the ant started, no matter how the ant traveled through the graph. This implies that the ant has completed a cycle; if this cycle happens to traverse all edges, then the ant has found an Eulerian cycle! Otherwise, Euler sent another ant to randomly traverse unexplored edges and thereby to trace a second cycle in the graph. Euler further showed that the two cycles discovered by the two ants can be combined into a single cycle. If this (larger) cycle contains all the edges in the graph, then the two ants have together found an Eulerian cycle! If not, Euler’s method recruits a third (fourth, fifth, etc.) ant, and eventually finds an Eulerian cycle. On modern computers, this algorithm can efficiently find Eulerian cycles in huge graphs having billions of nodes, thus avoiding the quagmire of NP-Completeness. Therefore, simply recasting our original problem into a slightly different framework has converted fragment assembly into a tractable computational problem; this is a commonly used strategy in computer science. The run time required by a computer implementation of Euler’s algorithm is roughly proportional to the number of edges in the graph E. In the Hamiltonian approach, the time is potentially a lot larger, due to the large number of pairwise alignments needed to construct the graph, and to the NP-Completeness of finding a Hamiltonian cycle. A more detailed comparison of these approaches is given in ref. 8. Unfortunately, de Bruijn graphs are not a cure-all. Throughout our exposition, we have made several simplifying assumptions, which require much work to iron out formally. Yet for every apparent complication to sequence assembly, it has proven fruitful to apply some cousin of de Bruijn graphs to transform a question involving Hamiltonian cycles into a different question regarding Eulerian cycles (Box 2). Moreover, analogs of de Bruijn graphs have been useful in many other bioinformatics problems, including antibody sequencing 9 , synteny block reconstruction 10 , and RNA assembly 11 . In each of these applications, the de Bruijn graph represents the experimental data in a manner that leads to a tractable computational problem. As new sequencing technologies emerge, the best computational strategies for assembling genomes from reads may change. The factors that influence the choice of algorithms include the quantity of data (measured by read length and coverage); quality of data (including error rates); and genome structure (such as number and size of repeated regions, and GC content). Short read sequencing technologies produce very large numbers of reads, which currently favors the use of de Bruijn graphs. De Bruijn graphs are also well suited to representing genomes with repeats, whereas overlap methods need to mask repeats that are longer than the read length. However, if a future sequencing technology produces high quality reads with tens of thousands of bases, a smaller number of reads would be needed, and the pendulum could swing back towards favoring overlap-based approaches for assembly. Supplementary Material 1

0 comments Cited 189 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

dbVar and DGVa: public archives for genomic structural variation

Ilkka Lappalainen, John Lopez, Lisa Skipper … (2012)

Much has changed in the last two years at DGVa (http://www.ebi.ac.uk/dgva) and dbVar (http://www.ncbi.nlm.nih.gov/dbvar). We are now processing direct submissions rather than only curating data from the literature and our joint study catalog includes data from over 100 studies in 11 organisms. Studies from human dominate with data from control and case populations, tumor samples as well as three large curated studies derived from multiple sources. During the processing of these data, we have made improvements to our data model, submission process and data representation. Additionally, we have made significant improvements in providing access to these data via web and FTP interfaces.

0 comments Cited 131 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Indexing Graphs for Path Queries with Applications in Genome Research.

Veli Mäkinen, Niko Välimäki, Jouni Sirén (2014)

We propose a generic approach to replace the canonical sequence representation of genomes with graph representations, and study several applications of such extensions. We extend the Burrows-Wheeler transform (BWT) of strings to acyclic directed labeled graphs, to support path queries as an extension to substring searching. We develop, apply, and tailor this technique to a) read alignment on an extended BWT index of a graph representing pan-genome, i.e., reference genome and known variants of it; and b) split-read alignment on an extended BWT index of a splicing graph. Other possible applications include probe/primer design, alignments to assembly graphs, and alignments to phylogenetic tree of partial-order graphs. We report several experiments on the feasibility and applicability of the approach. Especially on highly-polymorphic genome regions our pan-genome index is making a significant improvement in alignment accuracy.

0 comments Cited 118 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Title: Nature Biotechnology

Abbreviated Title: Nat Biotechnol

Publisher: Springer Science and Business Media LLC

ISSN (Print): 1087-0156

ISSN (Electronic): 1546-1696

Publication date Created: August 2019

Publication date (Electronic): August 2 2019

Publication date (Print): August 2019

Volume: 37

Issue: 8

Pages: 907-915

Article

DOI: 10.1038/s41587-019-0201-4

PMC ID: 7605509

PubMed ID: 31375807

SO-VID: 095be246-be11-4c64-8879-72d65f620217

License:

http://www.springer.com/tdm

History

Data availability:

Comments

Comment on this article

scite_

Cited by 3,933

See all cited by

- Version 1
- Version 1

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Read this article at

Abstract

Related collections

Recursive Rule based Visual Categorization

Most cited references 5

Why are de Bruijn graphs useful for genome assembly?

dbVar and DGVa: public archives for genomic structural variation

Indexing Graphs for Path Queries with Applications in Genome Research.

Author and article information

Journal

Article

History

Comments

Comment on this article

Similar content 2,879

Cited by 3,933