3
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Scalable, accessible, and reproducible reference genome assembly and evaluation in Galaxy

      Preprint
      research-article
      1 , 2 , 2 , 3 , 3 , 4 , 5 , 6 , 7 , 8 , 2 , 8 , 9 , 10 , 11 , 12 , 2 , 13 , 14 , 2 , 2 , 2 , 4 , 15 , 16 , 16 , 17 , 6 , 18 , 19 , 20 , 2 , 7 , 21 , 22 , 2 , 23 , 1 , 24 , 7 , 2 , , 5 , , 1 , , 2 ,
      bioRxiv
      Cold Spring Harbor Laboratory
      Genome assembly, opensource, large genomes, public, scalable, accessible, modularity, reproducibility

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Improvements in genome sequencing and assembly are enabling high-quality reference genomes for all species. However, the assembly process is still laborious, computationally and technically demanding, lacks standards for reproducibility, and is not readily scalable. Here we present the latest Vertebrate Genomes Project assembly pipeline and demonstrate that it delivers high-quality reference genomes at scale across a set of vertebrate species arising over the last ~500 million years. The pipeline is versatile and combines PacBio HiFi long-reads and Hi-C-based haplotype phasing in a new graph-based paradigm. Standardized quality control is performed automatically to troubleshoot assembly issues and assess biological complexities. We make the pipeline freely accessible through Galaxy, accommodating researchers even without local computational resources and enhanced reproducibility by democratizing the training and assembly process. We demonstrate the flexibility and reliability of the pipeline by assembling reference genomes for 51 vertebrate species from major taxonomic groups (fish, amphibians, reptiles, birds, and mammals).

          Related collections

          Most cited references52

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The Sequence Alignment/Map format and SAMtools

          Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: rd@sanger.ac.uk
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Fast and accurate short read alignment with Burrows–Wheeler transform

            Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ∼10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: rd@sanger.ac.uk
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Minimap2: pairwise alignment for nucleotide sequences

              Heng Li (2018)
              Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.
                Bookmark

                Author and article information

                Journal
                bioRxiv
                BIORXIV
                bioRxiv
                Cold Spring Harbor Laboratory
                30 June 2023
                : 2023.06.28.546576
                Affiliations
                [1 ]Dept. of Biochemistry and Molecular Biology, Pennsylvania State University, USA
                [2 ]Vertebrate Genome Laboratory, The Rockefeller University, USA
                [3 ]Bioinformatics Group, Department of Computer Science, Albert-Ludwigs-University Freiburg, Freiburg, Germany
                [4 ]Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
                [5 ]Departments of Biology and Computer Science, Johns Hopkins University, USA
                [6 ]Department of Medicine and Life Sciences (MELIS), Institut de Biologia Evolutiva, Universitat Pompeu Fabra-CSIC, Barcelona 08003, Spain
                [7 ]Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
                [8 ]Department of Quantitative and Computational Biology, University of Southern California
                [9 ]Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
                [10 ]Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
                [11 ]Wellcome Sanger Institute, Cambridge CB10 1SA, United Kingdom
                [12 ]Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russia
                [13 ]Department of Biosciences, University of Milan, Milan, Italy
                [14 ]BMRI, Weill Cornell Medical College, New York, 10021, USA
                [15 ]eGnome, Inc, Seoul, Republic of Korea
                [16 ]Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
                [17 ]Laboratory of Neurogenetics of Language, The Rockefeller University, New York City, NY, 10065, USA
                [18 ]Catalan Institution of Research and Advanced Studies (ICREA), Barcelona 08010, Spain.
                [19 ]CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona 08028, Spain.
                [20 ]Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, Cerdanyola del Vallès 08193, Spain.
                [21 ]Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus
                [22 ]University of Florence, Department of Biology, Via Madonna del Piano 6, Sesto Fiorentino (FI)
                [23 ]Tree of Life, Wellcome Sanger Institute, Cambridge CB10 1SA, United Kingdom
                [24 ]Department of Ecology & Evolution and Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
                Author notes
                []Co-Corresponding author
                [*]

                Co-First author

                Author contributions

                D. L. built the assembly pipeline with support from G. F., L. A., C. G., B. G., A. O., H. C., M. D. S., B. D. P, A. R., M. V. D. B., and the VGP assembly working group. L. A., A. D., G. R. G., A. M. G., G. M. G., N. J., C. J., B. O., D. D. P., S. S., M. S., and T. T. generated one or several assemblies used in the analyses. B. J. K., K. R., and M. C validated the zebra finch assemblies. J. C. performed the manual curation on the zebra finch assembly. L. A. assembled and evaluated the mitochondrial genomes. N. B. established the decontamination pipeline and performed the contamination analyses. N. B. and M. P-F. compared the scaffolding strategies. A. N. performed the analyses on XBP1. C. G. and B. D. P. developed the training material with support from the user community. J. R. B., N. J., T. T., B. O., O. F., C.L., H. K., T. M-B, and R. M. W. generated the PacBio and Hi-C data. G. F., M. C. S., A. N., A. M. P., and E. D. J., conceived the study and drafted the manuscript. All authors contributed to the manuscript and approved it.

                Article
                10.1101/2023.06.28.546576
                10327048
                37425881
                3850ad15-cb9e-43e3-92d0-df05caebd0fb

                This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.

                History
                Categories
                Article

                genome assembly,opensource,large genomes,public,scalable,accessible,modularity,reproducibility

                Comments

                Comment on this article