CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of “marker” genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.

Related collections

Most cited references 47

Record: found
Abstract: found
Article: not found

QUAST: quality assessment tool for genome assemblies.

Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi … (2013)

Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST-a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website. http://bioinf.spbau.ru/quast . Supplementary data are available at Bioinformatics online.

0 comments Cited 3282 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix

Morgan N. Price, Paramvir S Dehal, Adam Arkin (2009)

Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constructing large phylogenies and for estimating their reliability. Instead of storing a distance matrix, FastTree stores sequence profiles of internal nodes in the tree. FastTree uses these profiles to implement Neighbor-Joining and uses heuristics to quickly identify candidate joins. FastTree then uses nearest neighbor interchanges to reduce the length of the tree. For an alignment with N sequences, L sites, and a different characters, a distance matrix requires O(N 2) space and O(N 2 L) time, but FastTree requires just O(NLa + N ) memory and O(N log (N)La) time. To estimate the tree's reliability, FastTree uses local bootstrapping, which gives another 100-fold speedup over a distance matrix. For example, FastTree computed a tree and support values for 158,022 distinct 16S ribosomal RNAs in 17 h and 2.4 GB of memory. Just computing pairwise Jukes–Cantor distances and storing them, without inferring a tree or bootstrapping, would require 17 h and 50 GB of memory. In simulations, FastTree was slightly more accurate than Neighbor-Joining, BIONJ, or FastME; on genuine alignments, FastTree's topologies had higher likelihoods. FastTree is available at http://microbesonline.org/fasttree.

0 comments Cited 1291 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Pfam: the protein families database

Robert D. Finn, Alex Bateman, Jody Clements … (2013)

Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.

0 comments Cited 1094 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Genome Res

Journal ID (iso-abbrev): Genome Res

Journal ID (hwp): genome

Journal ID (pmc): genome

Journal ID (publisher-id): GENOME

Title: Genome Research

Publisher: Cold Spring Harbor Laboratory Press

ISSN (Print): 1088-9051

ISSN (Electronic): 1549-5469

Publication date (Print): July 2015

Publication date PMC-release: July 2015

Volume: 25

Issue: 7

Pages: 1043-1055

Affiliations

[1 ]Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia;

[2 ]Institute for Molecular Bioscience, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia;

[3 ]Advanced Water Management Centre, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia

Author notes

Corresponding authors: d.parks@ 123456uq.edu.au , g.tyson@ 123456uq.edu.au

Article

Medline ID: 9509184

DOI: 10.1101/gr.186072.114

PMC ID: 4484387

PubMed ID: 25977477

SO-VID: 0ae1b9e8-b130-4ccd-8db6-ded3355d87de

License:

This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

History

Date received : 20 October 2014

Date accepted : 13 May 2015

Page count

Pages: 13

Funding

Funded by: Natural Sciences and Engineering Research Council of Canada http://dx.doi.org/10.13039/501100000038

Funded by: Great Barrier Reef Foundation

Funded by: Australian Research Council http://dx.doi.org/10.13039/501100000923

Funded by: Discovery Outstanding Researcher Award (DORA)

Funded by: Australian Research Council http://dx.doi.org/10.13039/501100000923

Award ID: DP120103498

Award ID: DP1093175

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Read this article at

Abstract

Related collections

Nanopublications (single, attributable and machine-readable assertions in scientific literature)

Most cited references 47

QUAST: quality assessment tool for genome assemblies.

FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix

Pfam: the protein families database

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 238

Cited by 4,809

Most referenced authors 1,015