12
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

          Related collections

          Most cited references23

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

          Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

            Background There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions. Findings To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome. Conclusions Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Versatile and open software for comparing large genomes

              Background Genome sequence comparison has been an important method for understanding gene function and genome evolution since the early days of gene sequencing. The pairwise sequence-comparison methods implemented in BLAST [1] and FASTA [2] have proved invaluable in discovering the evolutionary relationships and functions of thousands of proteins from hundreds of different species. The most commonly used application of these sequence-analysis programs is for comparing a single gene (either a DNA sequence or the protein translation of that sequence) to a large database of other genes. The results of such protein and nucleotide database searches have been used in recent years as the basis for assigning function to most of the newly discovered genes emerging from genome projects. In recent years, an important new sequence-analysis task has emerged: comparing an entire genome with another. Until 1999, each new genome published was so distant from all previous genomes that aligning them would not yield interesting results. With the publication of the second strain of Helicobacter pylori [3] in 1999, following the publication of the first strain [4] in 1997, the scientific world had its first chance to look at two complete bacterial genomes whose DNA sequences lined up very closely. Comparison of these genomes revealed an overall genomic structure that was very similar, but showed evidence of two large inversion events centered on the replication origin. The comparison also made it clear that a new type of bioinformatics program was needed, one that could efficiently compare two megabase-scale sequences, something that BLAST cannot do. In response to this need, TIGR released MUMmer 1.0, the first system that could perform genome comparisons of this scale [5]. The first two releases of MUMmer had over 1,600 site licensees, a number that has grown since moving to an open-source license in May 2003. The number of pairs of closely related genomes has increased dramatically in recent years, with a corresponding increase in the number of scientific studies of genome structure and evolution, facilitated by new software that permits the comparisons of these genomes. As of mid-2003, there are more than 150 complete published genomes, with over 380 prokaryotic genome projects and 240 eukaryotic projects under way. Many of these involve species that are closely related to published genomes. The published databases already include 33 species for which at least one other closely related species has been sequenced; for a detailed list see [6]. More distantly related pairs of species, for example, Plasmodium falciparum and P. yoelii, fail to show DNA sequence similarity but do show large-scale similarity when their translated protein sequences are aligned, as described in earlier studies [7,8]. Related to the growing number of closely related species that have been sequenced is a rapid growth in the number of known species whose genomes are similar but have undergone significant rearrangement. The human and mouse genomes, for example, are both available in draft form, and the chromosomes of either species can be aligned with the other at the DNA level. Various lines of evidence in the past have pointed to massive genome rearrangements separating the species, and the latest analysis indicates that the mouse genome can be split into 217 large segments that can be rearranged to produce the same gene order as in the human genome [9]. This very large-scale similarity interrupted by rearrangements places additional demands on genome-comparison programs: essentially, one must produce all pairs of similar regions in the sequences (in form of local alignments), not merely a single 'best' or longest global alignment of the entire sequences. In addition to the need for whole-genome alignment programs, another need has become evident recently - a means of reliably evaluating and comparing genome assemblies. The explosion of genome sequencing has brought with it an explosion in genome-assembly programs, with several new assemblers either under development or recently released [10-12]. Unlike the previous generation of assemblers (TIGR Assembler [13], phrap [14], and CAP3 [15]), these second-generation assemblers are designed to handle large eukaryotic genomes. Assembly of large genomes is a major technical challenge, and once an assembly has been produced, evaluating it can be almost as difficult. Debates over the relative quality of assemblies produced by different assemblers are ongoing, and whole-genome comparison algorithms represent a critical tool in these analyses. Different assemblies of the same data should be nearly 100% identical, making the comparison problem analogous to the problem of comparing closely related species. Assembly differences may represent errors in one of the algorithms, and are useful for providing insights into the strengths and weaknesses of different methods. The large-scale comparison problem also occurs for assemblies delivered by the same software but from different inputs; for example, assemblies at threefold (3×) coverage and sixfold (6×) coverage of the same genome. With larger eukaryotic projects, multiple assemblies are run at different stages of the project, and comparisons of the successive assemblies provide a map showing how to transfer any analyses (such as gene predictions) from one assembly to another. A third use for rapid, large-scale alignment programs has come up in our own applications. As part of our annotation 'pipeline' at TIGR, we routinely rebuild a database containing the results of all-against-all BLAST searches for all known proteins. Each time a new genome is added to the public archives, many thousands of searches need to be re-run to incorporate the newly sequenced genes. Because of the size of the archive, these additional searches take a relatively long time. A rapid method for identifying potential hits is used as a pre-screen as follows: for each new gene that is being added to the database, we use the high-speed method (MUMmer) to determine if it has any potential hits. If it does not, then it can be omitted from subsequent BLAST searches. If a new genome has a large number of novel proteins, this pre-screening step can substantially reduce the time required to search it against the database. The new MUMmer system, version 3.0, addresses all of the above uses and more, including new graphical modules for viewing assembly comparisons and for looking at more distantly related species alignments. In addition, the implementations of all the fundamental search operations are now either optimal or nearly optimal, in the sense of running in time proportional to the sum of their input and output sizes. Other parts of the code have also been rewritten to improve their efficiency. What may be the most significant change with MUMmer 3.0 is that it is now an open-source system. All code is publicly available without restriction on its use or redistribution, and we encourage others to add to the code base and distribute their own improvements. The modularity of the code base makes it easily extendable as well. Others can build on our matching algorithm, for example, and create their own clustering and extension steps. Results Since the development of MUMmer 1.0 in 1999, several other programs for large-scale genome comparison have been developed, for example, SSAHA [16], AVID [17], MGA [18], BLASTZ [19], and LAGAN [20] (see also [21] for a review). Most of these programs follow an anchor-based approach, which can be divided into three phases: computation of potential anchors; computation of a colinear sequence of non-overlapping potential anchors - these anchors form the basis of the alignment; and alignment of the gaps in between the anchors. The traditional methods to compute potential anchors, that is, maximal matches of some length l or longer, use a generate-and-test approach. In a first step, all matches of some fixed length k c.) The parameters l, g, and c can all be set on the command line. The chain matches are then extended using an implementation of the Smith-Waterman dynamic programming algorithm [28], which is applied to the regions between the exact matches and also to the boundaries of the chains, which may be extended outward. This 'match and extend' step in the algorithm is essentially the same as that used by FASTA [29], BLAST [30], and many other sequence-alignment programs. When two species are very similar, such as the two isolates of the Bacillus anthracis Ames strain sequenced at TIGR [31-33], then MUMmer is ideally suited for aligning the genomes. In that comparison of anthrax isolates, only four single-nucleotide differences separated the two 5.3 Mbp main chromosomes from one another. Similarly, in our comparison of a clinical isolate of Mycobacterium tuberculosis to a laboratory strain [31], MUMmer quickly found the approximately 1,100 SNPs and a handful of IS elements that distinguished the strains. However, when the species being compared are more distant, Nucmer and Promer provide much more detailed and more useful alignments than MUMmer alone. In the examples described below, we show how each of the programs described here may be run for genomes at different evolutionary distances Fly versus fly The 130 Mbp genome of D. melanogaster is largely complete, with the six main chromosome arms containing only a few gaps. Recently, the Human Genome Sequencing Center at Baylor College of Medicine completed the shotgun sequencing of D. pseudoobscura, a closely related species with a genome of approximately the same size. These two species are close enough that almost all genes are shared, and exons show a high level of sequence identity. However, they are sufficiently distant that intergenic regions and introns do not align well, and there have been hundreds of large-scale chromosomal rearrangements since the species diverged. Thus, one cannot simply align each chromosome arm to its counterpart. Complicating matters further, the D. pseudoobscura shotgun assembly consists of thousands of scaffolds and contigs. To facilitate comparison, the first computational task is to align all the scaffolds to each of the D. melanogaster arms. (The comprehensive analysis of D. pseudoobscura, organized by the sequencing center scientists and their collaborators, will appear in a future paper. The description here is primarily intended to illustrate the use and capabilities of Nucmer.) We ran the Nucmer program with a minimum match length of 25, which was adequate to capture virtually all matching exons. Because matching genes are much longer, we required cluster chains to contain at least 100 matching nucleotides. To account for long introns and to allow the program to cluster together multiple genes, we allowed the gap between exact matches to be as long as 3,000 bp. At the time of our analysis (before completion of the sequencing project), the D. pseudoobscura assembly contained 4,653 scaffolds spanning 150 Mbp. We ran Nucmer separately to align the full set of scaffolds to each D. melanogaster chromosome arm. Using these settings, the program takes about 6 minutes per arm and uses approximately 490 Mb of memory on a 2.8 GHz desktop Pentium 4 PC running Linux. Fly versus mosquito When the two species are more distantly related, the only means of detecting large-scale similarity is through comparisons on the amino acid level. One example of this phenomenon arose during our comparison of the genomes of the malaria mosquito, Anopheles gambiae, and the fruit fly D. melanogaster. Because Anopheles was the second insect genome to be sequenced, the only available species for comparison was fruit fly. Our detailed analysis, done jointly with colleagues at the European Molecular Biology Laboratory in Heidelberg, was based on a combination of BLAST and MUMmer analysis [34]. These two species diverged about 250 million years ago, and they have an average protein sequence identity of 56%, less than that shared between humans and pufferfish. Although the two insects have the same number of chromosomes, the Anopheles genome is approximately twice as large, and the gene order has been almost completely shuffled, as our alignments revealed. Only small, but numerous, regions of 'microsynteny' remain: we reported 948 regions, the largest containing 8 genes in Anopheles and 31 in Drosophila. An interesting finding, though, was that despite extensive shuffling, each chromosome arm had a clear predominance of homologs on a single arm in the other species, indicating that intrachromosome gene shuffling was the primary force affecting gene order (see Figure 7 of [34]). Fungus versus fungus In a current application, we are using both Nucmer and Promer to compare two related fungal genomes, Aspergillus fumigatus (a human pathogen) and A. nidulans (a non-pathogenic model organism). Shotgun sequencing of these two genomes has been completed, and A. fumigatus is in the process of being completely finished; that is, all gaps are being closed. (A. fumigatus is a joint sequencing project of TIGR and The Sanger Institute, while A. nidulans is being sequenced at the Whitehead/MIT Genome Center.) At the time of our most recent comparison, the A. fumigatus genome had progressed to the point where it was assembled into 19 scaffolds spanning 28 Mbp, and the A. nidulans genome was assembled into 238 contigs spanning 30 Mbp. For this comparison, we first ran Nucmer and found that most of the two genomes mapped onto one another quite clearly: there are sufficient matches to reveal large segments of similarity in a simple dot plot. There has been extensive rearrangement of the chromosomes, but large-scale synteny is still present. For example, the largest contig (A1058) in A. fumigatus, at 2.9 Mbp, representing an essentially complete chromosome, maps onto five different scaffolds in A. nidulans. If one looks only at the Nucmer alignment of the largest of these, a 2.1 Mbp scaffold containing 10 contigs, it appears to be rearranged into multiple segments, but the matches are so scattered that it is difficult to tell how many segments there are (Figure 1, left-hand side). The syntenic alignment is much more clearly visible, however, if we use Promer instead. The simplest summary is just the number of bases included in the alignments: if we look at the Nucmer alignment between the scaffolds, the total number of matching bases is 81 kbp. In contrast, the Promer alignment covers 1.87 Mbp of A1058, beginning at nucleotide position 1,000,000 and continuing to the end of the chromosome. A graphical illustration is shown in Figure 1, which displays both the Promer and Nucmer alignments between the 2.1 Mbp scaffold from A. nidulans and scaffold A1058 of A. fumigatus. As the figure makes clear, the amino-acid-based alignment covers much more of the sequence of both species, and is therefore much more useful for determining homologous relationships between genes and chromosomal relationships. Human versus human One of the most challenging computational tasks one can perform today is the cross-comparison of mammalian genomes. The human and mouse genomes are sufficiently complete that much ongoing research is based on mappings between these two species. As shown in Table 1, MUMmer 3.0 can compare human and mouse chromosomes in a matter of minutes. The table shows the time (7 minutes 10 seconds, on a 2.4 GHz Pentium processor) required to align mouse chromosome 16 (Mm16) to human chromosome 21 (Hs21). These two were chosen because nearly all of Hs21 maps to one end of Mm16; in fact, researchers have developed a mouse model of Down syndrome that has an extra copy of this part of Mm16. We ran a benchmark test of MUMmer 3.0 in which we compared the human genome (version of 3 January 2003, downloaded from GenBank) to itself by computing all maximal matches of length at least 300 between each chromosome and all the others. The resulting 631,975 matches allow one to identify both large- and small-scale interchromosomal duplications. Note that the run-times reported in [6] are only for the match-finding part of MUMmer. The time for processing clusters and performing alignments in the gaps between matches are omitted as these vary widely depending on the parameters used. For this test, we needed a maximum of about 4 GB of memory. As we did not have a PC available with this amount of memory, we used a Sun-Sparc computer running the Solaris operating system, with 64 GB of memory and a 950 MHz processor. We ran the alignment as follows. Each human chromosome was used as a reference, and the rest of the genome was used as a query and streamed against it. To avoid duplication, we only included chromosomes in the query if they had not already been compared; thus we first used chromosome 1 as a reference, and streamed the other 23 chromosomes against it. Then we used chromosome 2 as a reference, and streamed chromosomes 3-22, X, and Y against that, and so on. The total length of all human chromosomes for this test was 2,839 Mbp. The time required to build all the suffix trees was 4.7 hours. The space requirement for the suffix tree was remarkably constant, with about 15.5 bytes per base-pair (with only one exception). The total query time was 101.5 hours, and memory usage never exceeded 3.9 GB (see [6] for details). Thus, in approximately 4.5 days on a single processor, we matched the human genome against itself. This could easily be divided up among multiple computers, with each chromosome handled separately, bringing the time down to just 11 hours. Graphical viewers Because the text-format output of MUMmer 3.0 is often voluminous, we have developed two graphical viewers, one for the purpose of comparing two genome assemblies or near-identical sequences, and the other for comparing more distantly related genomes, such as two distinct species. The first viewer, DisplayMUMs, is an open-source, platform-independent Java program. It has been tested on a variety of Unix/Linux platforms and also runs on Apple Macintosh (OS X) or Microsoft Windows computers. The program, which takes as input the results of running MUMmer, allows the user to align and view the results of two different assemblies of the same or very closely related genomes and to tile one set of contigs onto the other. This provides a powerful graphical front end for assembly comparison, a function that is frequently used in the process of assembling and finishing genomes. It allows a user to visualize the tiling of sequence reads onto an assembly in order to understand why contigs might not have properly merged together. Alternatively, one can compare the output of different genome assemblers on the same data, a task that can be quite bewildering when the genome is large and the assemblers disagree. DisplayMUMs creates a stand-alone display, illustrated in Figure 2. It contains three main areas. The upper area can show a variety of types of information, including zoomed-in nucleotide alignments. The central panel shows a summary of the alignment, with the reference shown as a gray bar. The matches of the queries to the reference are shown as green (forward) and red (reverse) rectangles, with gaps indicated in gray. A second gray bar shows the gaps in blue, which may seem redundant but is useful when the scale is zoomed out; for example, if the sequence has only one small gap and the scale shows 1 Mbp, then the small gap will be invisible in the upper bar but will still be visible on the lower bar. The lower panel shows the tiling of all the query sequences on the reference, with red and green colors indicating the forward and reverse matching substrings. As Figure 2 shows, some sequences might match for only a small portion of their length, while others will match across their entire length. DisplayMUMs has many other features, including mouse-over and searching functions, all of which are documented in the software. As this example makes clear, its primary purpose is to improve the utility of MUMmer for genome-assembly analysis. The second viewer, MapView, creates a picture of the mapping between two species based on Nucmer or Promer output. The motivation for creating this viewer was the rapidly increasing number of genome projects that are undertaken to enhance our understanding of another, already completed genome. In these projects, the second genome may have only faint DNA sequence similarity to the first, and in some cases the similarity may be detectable only through protein sequence alignments, such as those produced by Promer. A good example of such a project is the recent effort to sequence D. pseudoobscura mentioned above. The primary motivation for this project is to improve the annotation of D. melanogaster, and MUMmer is one of the tools being used to map the newly assembled D. pseudoobscura onto it. Because the reference genome is well annotated, we included in the viewer the option to display the locations of the genes (and their identifiers) along with the mapping at either the DNA or amino acid sequence level. A snapshot of this alignment by MapView is in Figure 3, which makes it clear that the amino acid conservation between these two species closely matches the annotated exon structure. This viewer can be used to highlight areas of a genome where exons might have been missed in previous analyses. The MapView program can produce output in three formats: fig (for viewing with the Unix xfig program), PostScript, or PDF. The most flexible format, fig, allows for unlimited scrolling and zooming, and for export to a wide range of additional formats. This makes it easy to view the mapping between a large collection of contigs and a large chromosome. Conclusions As the examples above show, the capabilities of MUMmer 3.0 enable a researcher to compare virtually any two genomes, or collections of genomic sequences, using computers widely available today. Bacterial genomes and relatively small eukaryotes can be aligned on a standard desktop computer, while larger genomes may require larger, server-class machines. With the state of the art representation of the suffix-tree data structure, the memory usage of MUMmer 3.0 is close to the minimum possible, while retaining optimal or near-optimal worst-case run time, depending on the match algorithm used. The additional features in MUMmer 3.0 allow one to find non-unique and non-exact matches, greatly enhancing the flexibility of the system. Finally, by making the system open source, we hope to encourage others to expand upon and improve the code base, which is freely available to all.
                Bookmark

                Author and article information

                Journal
                Genome Res
                Genome Res
                genome
                genome
                GENOME
                Genome Research
                Cold Spring Harbor Laboratory Press
                1088-9051
                1549-5469
                May 2017
                : 27
                : 5
                : 787-792
                Affiliations
                [1 ]Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA;
                [2 ]Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA;
                [3 ]Department of Plant Sciences, University of California, Davis, California 95616, USA;
                [4 ]National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;
                [5 ]Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;
                [6 ]Departments of Mathematics and Physics, University of Maryland, College Park, Maryland 20742, USA;
                [7 ]Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21218, USA
                Author notes
                Corresponding author: salzberg@ 123456jhu.edu
                Article
                9509184
                10.1101/gr.213405.116
                5411773
                28130360
                3d308b08-8501-4975-9707-44a5a8d5e5cd
                © 2017 Zimin et al.; Published by Cold Spring Harbor Laboratory Press

                This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

                History
                : 26 July 2016
                : 18 January 2017
                Funding
                Funded by: National Science Foundation http://dx.doi.org/10.13039/100000001
                Award ID: IOS-1238231
                Funded by: National Institutes of Health http://dx.doi.org/10.13039/100000002
                Award ID: R01 HG006677
                Funded by: National Human Genome Research Institute http://dx.doi.org/10.13039/100000051
                Funded by: NIH http://dx.doi.org/10.13039/100000002
                Categories
                Method

                Comments

                Comment on this article