Waste not, want not: why rarefying microbiome data is inadmissible.

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

Related collections

Most cited references 31

Record: found
Abstract: not found
Article: not found

Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness

Nicholas Gotelli, Robert K. Colwell (2001)

0 comments Cited 1331 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Small-sample estimation of negative binomial dispersion, with applications to SAGE data.

Mark Robinson, Gordon K. Smyth (2008)

We derive a quantile-adjusted conditional maximum likelihood estimator for the dispersion parameter of the negative binomial distribution and compare its performance, in terms of bias, to various other methods. Our estimation scheme outperforms all other methods in very small samples, typical of those from serial analysis of gene expression studies, the motivating data for this study. The impact of dispersion estimation on hypothesis testing is studied. We derive an "exact" test that outperforms the standard approximate asymptotic tests.

0 comments Cited 467 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex.

Micah Hamady, Jeffrey J Walker, J Kirk Harris … (2008)

We constructed error-correcting DNA barcodes that allow one run of a massively parallel pyrosequencer to process up to 1,544 samples simultaneously. Using these barcodes we processed bacterial 16S rRNA gene sequences representing microbial communities in 286 environmental samples, corrected 92% of sample assignment errors, and thus characterized nearly as many 16S rRNA genes as have been sequenced to date by Sanger sequencing.

0 comments Cited 464 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (iso-abbrev): PLoS Comput Biol

Title: PLoS computational biology

Publisher: Public Library of Science (PLoS)

ISSN (Electronic): 1553-7358

ISSN (Print): 1553-734X

Publication date (Electronic): Apr 2014

Volume: 10

Issue: 4

Affiliations

[1 ] Statistics Department, Stanford University, Stanford, California, United States of America.

Article

Publisher Item ID: PCOMPBIOL-D-13-01815

DOI: 10.1371/journal.pcbi.1003531

PMC ID: 3974642

PubMed ID: 24699258

SO-VID: bfd2b0da-811a-4515-ab5e-8e589de8f0b7

History

Data availability:

Comments

Comment on this article

scite_

Cited by 1,096

See all cited by

Most referenced authors 1,122

See all reference authors

- Version 1
- Version 1

Waste not, want not: why rarefying microbiome data is inadmissible.

Read this article at

Abstract

Related collections

Tick microbiome

Most cited references 31

Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness

Small-sample estimation of negative binomial dispersion, with applications to SAGE data.

Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex.

Author and article information

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 14

Cited by 1,096

Most referenced authors 1,122