43
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

      research-article
      1 , 2 , 3 , 1 , 1 , 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 , 3 , 3 , 3 , 3 , 3 , 3 , 4 , 5 , 6 , 6 , 7 , 6 , 5 , 3 , 2 , 4 , 8 , 9 , 1
      Genome Research
      Cold Spring Harbor Laboratory Press

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

          Related collections

          Most cited references78

          • Record: found
          • Abstract: found
          • Article: not found

          The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

          Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            A global reference for human genetic variation

            The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

              Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small e.g., 180 bp and large e.g., 3–5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.
                Bookmark

                Author and article information

                Journal
                Genome Res
                Genome Res
                genome
                genome
                GENOME
                Genome Research
                Cold Spring Harbor Laboratory Press
                1088-9051
                1549-5469
                May 2017
                May 2017
                : 27
                : 5
                : 849-864
                Affiliations
                [1 ]National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
                [2 ]McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA;
                [3 ]Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom;
                [4 ]European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom;
                [5 ]National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;
                [6 ]Pacific Biosciences, Menlo Park, California 94025, USA;
                [7 ]Broad Institute, Cambridge, Massachusetts 02142, USA;
                [8 ]Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA;
                [9 ]Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
                Author notes

                Present addresses: 10Nationwide Children's Hospital, Columbus, OH 43205, USA; 11King's College London, London WC2R 2LS, UK; 12Ontario Institute for Cancer Research, Toronto, Ontario, Canada M5G 0A3; 13Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 2E4; 1410X Genomics, Pleasanton, CA 94566, USA

                Corresponding author: schneiva@ 123456ncbi.nlm.nih.gov
                Article
                9509184
                10.1101/gr.213611.116
                5411779
                28396521
                2f7ed9ce-f4ca-4ef1-8da9-8db959c167ec
                © 2017 Schneider et al.; Published by Cold Spring Harbor Laboratory Press

                This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

                History
                : 29 July 2016
                : 14 March 2017
                Funding
                Funded by: National Institutes of Health http://dx.doi.org/10.13039/100000002
                Funded by: National Library of Medicine http://dx.doi.org/10.13039/100000092
                Funded by: Wellcome Trust http://dx.doi.org/10.13039/100004440
                Award ID: WT095908
                Award ID: WT098051
                Award ID: WT104947/Z/14/Z
                Funded by: European Molecular Biology Laboratory
                Funded by: National Human Genome Research Institute http://dx.doi.org/10.13039/100000051
                Funded by: National Institutes of Health http://dx.doi.org/10.13039/100000002
                Funded by: National Institutes of Health http://dx.doi.org/10.13039/100000002
                Funded by: National Institutes of Health http://dx.doi.org/10.13039/100000002
                Award ID: 5U54HG003079
                Award ID: 5U41HG007635
                Funded by: National Institutes of Health http://dx.doi.org/10.13039/100000002
                Award ID: HG002385
                Award ID: HG007635
                Funded by: Howard Hughes Medical Institute http://dx.doi.org/10.13039/100000011
                Categories
                Resource

                Comments

                Comment on this article