451
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Waste not, want not: why rarefying microbiome data is inadmissible.

      1 , 1
      PLoS computational biology
      Public Library of Science (PLoS)

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

          Related collections

          Most cited references31

          • Record: found
          • Abstract: not found
          • Article: not found

          Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Small-sample estimation of negative binomial dispersion, with applications to SAGE data.

            We derive a quantile-adjusted conditional maximum likelihood estimator for the dispersion parameter of the negative binomial distribution and compare its performance, in terms of bias, to various other methods. Our estimation scheme outperforms all other methods in very small samples, typical of those from serial analysis of gene expression studies, the motivating data for this study. The impact of dispersion estimation on hypothesis testing is studied. We derive an "exact" test that outperforms the standard approximate asymptotic tests.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex.

              We constructed error-correcting DNA barcodes that allow one run of a massively parallel pyrosequencer to process up to 1,544 samples simultaneously. Using these barcodes we processed bacterial 16S rRNA gene sequences representing microbial communities in 286 environmental samples, corrected 92% of sample assignment errors, and thus characterized nearly as many 16S rRNA genes as have been sequenced to date by Sanger sequencing.
                Bookmark

                Author and article information

                Journal
                PLoS Comput Biol
                PLoS computational biology
                Public Library of Science (PLoS)
                1553-7358
                1553-734X
                Apr 2014
                : 10
                : 4
                Affiliations
                [1 ] Statistics Department, Stanford University, Stanford, California, United States of America.
                Article
                PCOMPBIOL-D-13-01815
                10.1371/journal.pcbi.1003531
                3974642
                24699258
                bfd2b0da-811a-4515-ab5e-8e589de8f0b7
                History

                Comments

                Comment on this article