24
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      RESCRIPt: Reproducible sequence taxonomy reference database management

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.

          Author summary

          Generating and managing sequence and taxonomy reference data presents a bottleneck to many researchers, whether they are generating custom databases or attempting to format existing, curated reference databases for use with standard sequence analysis tools. Evaluating database quality and choosing the “best” database can be an equally formidable challenge. We developed RESCRIPt to alleviate this bottleneck, supporting reproducible, streamlined generation, curation, and evaluation of reference sequence databases. RESCRIPt uses QIIME 2 artifact file formats, which store all processing steps as data provenance within each file, allowing researchers to retrace the computational steps used to generate any given file. We used RESCRIPt to benchmark several commonly used marker-gene sequence databases for 16S rRNA genes, ITS, and COI sequences, demonstrating both the utility of RESCRIPt to streamline use of these databases, but also to evaluate several qualitative and quantitative characteristics of each database. We show that larger databases are not always best, and curation steps to reduce redundancy and filter out noisy sequences may be beneficial for some applications. We anticipate that RESCRIPt will streamline the use, management, and evaluation/selection of reference database materials for microbiomics, diet metabarcoding, eDNA, and other diverse applications.

          Related collections

          Most cited references133

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

          We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The SILVA ribosomal RNA gene database project: improved data processing and web-based tools

            SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              SciPy 1.0: fundamental algorithms for scientific computing in Python

              SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: Formal analysisRole: MethodologyRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: MethodologyRole: SoftwareRole: ValidationRole: Writing – original draftRole: Writing – review & editing
                Role: Data curationRole: MethodologyRole: ResourcesRole: SoftwareRole: ValidationRole: Writing – review & editing
                Role: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: Project administrationRole: SoftwareRole: SupervisionRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput Biol
                plos
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                8 November 2021
                November 2021
                : 17
                : 11
                : e1009581
                Affiliations
                [1 ] University of Arkansas for Medical Sciences, Department of Biomedical Informatics, Little Rock, Arkansas, United States of America
                [2 ] Pathogen and Microbiome Institute, Northern Arizona University, Flagstaff, Arizona, United States of America
                [3 ] School of Science, University of New South Wales, Canberra, Australia
                [4 ] Laboratory of Food Systems Biotechnology, Institute of Food, Nutrition, and Health, ETH Zürich, Switzerland
                Johns Hopkins University, UNITED STATES
                Author notes

                The authors declare that they have no competing interests.

                Author information
                https://orcid.org/0000-0001-7119-6301
                https://orcid.org/0000-0002-0214-5073
                https://orcid.org/0000-0002-5318-9551
                https://orcid.org/0000-0001-6285-8852
                https://orcid.org/0000-0002-7713-1952
                https://orcid.org/0000-0002-1784-8935
                Article
                PCOMPBIOL-D-21-00735
                10.1371/journal.pcbi.1009581
                8601625
                34748542
                96b08f75-2656-446b-bdf1-3ae97be44ccc
                © 2021 Robeson, II et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 22 April 2021
                : 21 October 2021
                Page count
                Figures: 14, Tables: 0, Pages: 37
                Funding
                The authors received no specific funding for this work.
                Categories
                Research Article
                Biology and Life Sciences
                Taxonomy
                Computer and Information Sciences
                Data Management
                Taxonomy
                Research and Analysis Methods
                Database and Informatics Methods
                Biological Databases
                Sequence Databases
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Sequence Databases
                Research and Analysis Methods
                Database and Informatics Methods
                Biological Databases
                Genomic Databases
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Genomic Databases
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Analysis
                Genomic Databases
                Biology and life sciences
                Biochemistry
                Nucleic acids
                RNA
                Non-coding RNA
                Ribosomal RNA
                Biology and life sciences
                Biochemistry
                Ribosomes
                Ribosomal RNA
                Biology and life sciences
                Cell biology
                Cellular structures and organelles
                Ribosomes
                Ribosomal RNA
                Biology and Life Sciences
                Taxonomy
                Microbial Taxonomy
                Computer and Information Sciences
                Data Management
                Taxonomy
                Microbial Taxonomy
                Physical Sciences
                Physics
                Thermodynamics
                Entropy
                Research and Analysis Methods
                Database and Informatics Methods
                Biology and Life Sciences
                Genetics
                Genomics
                Custom metadata
                vor-update-to-uncorrected-proof
                2021-11-18
                Data reporting: All data analysed herein, were retrieved either using RESCRIPt (for SILVA [ https://www.arb-silva.de/] and NCBI [ https://www.ncbi.nlm.nih.gov/genbank/]) data, or by direct download of release data (for UNITE [ https://unite.ut.ee/], Greengenes [ftp://greengenes.microbio.me/greengenes_release/gg_13_5/], and GTDB [ https://gtdb.ecogenomic.org/] or by direct download (for BOLD [ https://www.boldsystems.org/] data; accessed July 1, 2020 and updated August 8, 2020). Availability of data and materials: Workflows and data from our benchmarks can be found at https://github.com/bokulich-lab/db-benchmarks-2020 and https://github.com/devonorourke/COIdatabases/. Code reporting: Source code, installation and usage instructions, and tutorials for RESCRIPt can be found at the project page: https://github.com/bokulich-lab/RESCRIPt.

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article