RESCRIPt: Reproducible sequence taxonomy reference database management

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.

Author summary

Generating and managing sequence and taxonomy reference data presents a bottleneck to many researchers, whether they are generating custom databases or attempting to format existing, curated reference databases for use with standard sequence analysis tools. Evaluating database quality and choosing the “best” database can be an equally formidable challenge. We developed RESCRIPt to alleviate this bottleneck, supporting reproducible, streamlined generation, curation, and evaluation of reference sequence databases. RESCRIPt uses QIIME 2 artifact file formats, which store all processing steps as data provenance within each file, allowing researchers to retrace the computational steps used to generate any given file. We used RESCRIPt to benchmark several commonly used marker-gene sequence databases for 16S rRNA genes, ITS, and COI sequences, demonstrating both the utility of RESCRIPt to streamline use of these databases, but also to evaluate several qualitative and quantitative characteristics of each database. We show that larger databases are not always best, and curation steps to reduce redundancy and filter out noisy sequences may be beneficial for some applications. We anticipate that RESCRIPt will streamline the use, management, and evaluation/selection of reference database materials for microbiomics, diet metabarcoding, eDNA, and other diverse applications.

Related collections

Most cited references 133

Record: found
Abstract: found
Article: found

Is Open Access

MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

Kazutaka Katoh, Daron Standley (2013)

We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.

0 comments Cited 10620 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The SILVA ribosomal RNA gene database project: improved data processing and web-based tools

Christian Quast, Elmar Pruesse, Pelin Yilmaz … (2012)

SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.

0 comments Cited 7157 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

SciPy 1.0: fundamental algorithms for scientific computing in Python

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant … (2020)

SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.

0 comments Cited 5957 times     Rated -3 of 5. – based on 1 reviews

Bookmark

All references

Author and article information

Contributors

Michael S. Robeson II:

ORCID: https://orcid.org/0000-0001-7119-6301

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Devon R. O’Rourke:

ORCID: https://orcid.org/0000-0002-0214-5073

Role: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Benjamin D. Kaehler:

ORCID: https://orcid.org/0000-0002-5318-9551

Role: ConceptualizationRole: Formal analysisRole: MethodologyRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Michal Ziemski:

ORCID: https://orcid.org/0000-0001-6285-8852

Role: MethodologyRole: SoftwareRole: ValidationRole: Writing – original draftRole: Writing – review & editing

Matthew R. Dillon:

ORCID: https://orcid.org/0000-0002-7713-1952

Role: Data curationRole: MethodologyRole: ResourcesRole: SoftwareRole: ValidationRole: Writing – review & editing

Jeffrey T. Foster: Role: SupervisionRole: Writing – review & editing

Nicholas A. Bokulich:

ORCID: https://orcid.org/0000-0002-1784-8935

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: Project administrationRole: SoftwareRole: SupervisionRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Mihaela Pertea: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput Biol

Journal ID (publisher-id): plos

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date (Electronic): 8 November 2021

Publication date Collection: November 2021

Volume: 17

Issue: 11

Electronic Location Identifier: e1009581

Affiliations

[1 ] University of Arkansas for Medical Sciences, Department of Biomedical Informatics, Little Rock, Arkansas, United States of America

[2 ] Pathogen and Microbiome Institute, Northern Arizona University, Flagstaff, Arizona, United States of America

[3 ] School of Science, University of New South Wales, Canberra, Australia

[4 ] Laboratory of Food Systems Biotechnology, Institute of Food, Nutrition, and Health, ETH Zürich, Switzerland

Johns Hopkins University, UNITED STATES

Author notes

The authors declare that they have no competing interests.

* E-mail: nicholas.bokulich@ 123456hest.ethz.ch

Author information

Michael S. Robeson II https://orcid.org/0000-0001-7119-6301

Devon R. O’Rourke https://orcid.org/0000-0002-0214-5073

Benjamin D. Kaehler https://orcid.org/0000-0002-5318-9551

Michal Ziemski https://orcid.org/0000-0001-6285-8852

Matthew R. Dillon https://orcid.org/0000-0002-7713-1952

Nicholas A. Bokulich https://orcid.org/0000-0002-1784-8935

Article

Publisher ID: PCOMPBIOL-D-21-00735

DOI: 10.1371/journal.pcbi.1009581

PMC ID: 8601625

PubMed ID: 34748542

SO-VID: 96b08f75-2656-446b-bdf1-3ae97be44ccc

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 22 April 2021

Date accepted : 21 October 2021

Page count

Figures: 14, Tables: 0, Pages: 37

Funding

The authors received no specific funding for this work.

Custom metadata

PLOS Publication Stage vor-update-to-uncorrected-proof

Publication Update 2021-11-18

Data Availability Data reporting: All data analysed herein, were retrieved either using RESCRIPt (for SILVA [ https://www.arb-silva.de/] and NCBI [ https://www.ncbi.nlm.nih.gov/genbank/]) data, or by direct download of release data (for UNITE [ https://unite.ut.ee/], Greengenes [ftp://greengenes.microbio.me/greengenes_release/gg_13_5/], and GTDB [ https://gtdb.ecogenomic.org/] or by direct download (for BOLD [ https://www.boldsystems.org/] data; accessed July 1, 2020 and updated August 8, 2020). Availability of data and materials: Workflows and data from our benchmarks can be found at https://github.com/bokulich-lab/db-benchmarks-2020 and https://github.com/devonorourke/COIdatabases/. Code reporting: Source code, installation and usage instructions, and tutorials for RESCRIPt can be found at the project page: https://github.com/bokulich-lab/RESCRIPt.

ScienceOpen disciplines: Quantitative & Systems biology

Data availability:

ScienceOpen disciplines: Quantitative & Systems biology

Comments

Comment on this article

scite_

Cited by 134

See all cited by

Most referenced authors 6,690

See all reference authors

- Version 1

RESCRIPt: Reproducible sequence taxonomy reference database management

Read this article at

Abstract

Author summary

Related collections

Journal of Systems Thinking Preprints

Most cited references 133

MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

The SILVA ribosomal RNA gene database project: improved data processing and web-based tools

SciPy 1.0: fundamental algorithms for scientific computing in Python

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 95

Cited by 134

Most referenced authors 6,690