497
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Inferring Correlation Networks from Genomic Survey Data

      research-article
      1 , 1 , 2 , 3 , *
      PLoS Computational Biology
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          High-throughput sequencing based techniques, such as 16S rRNA gene profiling, have the potential to elucidate the complex inner workings of natural microbial communities - be they from the world's oceans or the human gut. A key step in exploring such data is the identification of dependencies between members of these communities, which is commonly achieved by correlation analysis. However, it has been known since the days of Karl Pearson that the analysis of the type of data generated by such techniques (referred to as compositional data) can produce unreliable results since the observed data take the form of relative fractions of genes or species, rather than their absolute abundances. Using simulated and real data from the Human Microbiome Project, we show that such compositional effects can be widespread and severe: in some real data sets many of the correlations among taxa can be artifactual, and true correlations may even appear with opposite sign. Additionally, we show that community diversity is the key factor that modulates the acuteness of such compositional effects, and develop a new approach, called SparCC (available at https://bitbucket.org/yonatanf/sparcc), which is capable of estimating correlation values from compositional data. To illustrate a potential application of SparCC, we infer a rich ecological network connecting hundreds of interacting species across 18 sites on the human body. Using the SparCC network as a reference, we estimated that the standard approach yields 3 spurious species-species interactions for each true interaction and misses 60% of the true interactions in the human microbiome data, and, as predicted, most of the erroneous links are found in the samples with the lowest diversity.

          Author Summary

          Genomic survey data, such as those obtained from 16S rRNA gene sequencing, are subject to underappreciated mathematical difficulties that can undermine standard data analysis techniques. We show that these effects can lead to erroneous correlations among taxa within the human microbiome despite the statistical significance of the associations. To overcome these difficulties, we developed SparCC; a novel procedure, tailored to the properties of genomic survey data, that allow inference of correlations between genes or species. We use SparCC to elucidate networks of interaction among microbial species living in or on the human body.

          Related collections

          Most cited references3

          • Record: found
          • Abstract: found
          • Article: not found

          Illumina-based analysis of microbial community diversity.

          Microbes commonly exist in milieus of varying complexity and diversity. Although cultivation-based techniques have been unable to accurately capture the true diversity within microbial communities, these deficiencies have been overcome by applying molecular approaches that target the universally conserved 16S ribosomal RNA gene. The recent application of 454 pyrosequencing to simultaneously sequence thousands of 16S rDNA sequences (pyrotags) has revolutionized the characterization of complex microbial communities. To date, studies based on 454 pyrotags have dominated the field, but sequencing platforms that generate many more sequence reads at much lower costs have been developed. Here, we use the Illumina sequencing platform to design a strategy for 16S amplicon analysis (iTags), and assess its generality, practicality and potential complications. We fabricated and sequenced paired-end libraries of amplified hyper-variable 16S rDNA fragments from sets of samples that varied in their contents, ranging from a single bacterium to highly complex communities. We adopted an approach that allowed us to evaluate several potential sources of errors, including sequencing artifacts, amplification biases, non-corresponding paired-end reads and mistakes in taxonomic classification. By considering each source of error, we delineate ways to make biologically relevant and robust conclusions from the millions of sequencing reads that can be readily generated by this technology.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Microbiology in the post-genomic era.

            Genomics has revolutionized every aspect of microbiology. Now, 13 years after the first bacterial genome was sequenced, it is important to pause and consider what has changed in microbiology research as a consequence of genomics. In this article, we review the evolving field of bacterial typing and the genomic technologies that enable comparative analysis of multiple genomes and the metagenomes of complex microbial environments, and address the implications of the genomic era for the future of microbiology.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              The tragedy of the uncommon: understanding limitations in the analysis of microbial diversity.

              Molecular microbial community analysis methods have revolutionized our understanding of the diversity and distribution of bacteria, archaea and microbial eukaryotes. The information obtained has adequately demonstrated that the analysis of microbial model systems can provide important insights into ecosystem function and stability. However, the terminology and metrics used in macroecology must be applied cautiously because the methods available to characterize microbial diversity are inherently limited in their ability to detect the many numerically minor constituents of microbial communities. In this review, we focus on the use of indices to quantify the diversity found in microbial communities, and on the methods used to generate the data from which those indices are calculated. Useful conclusions regarding diversity can only be deduced if the properties of the various methods used are well understood. The commonly used diversity metrics differ in the weight they give to organisms that differ in abundance, so understanding the properties of these metrics is essential. In this review, we illustrate important methodological and metric-dependent differences using simulated communities. We conclude that the assessment of richness in complex communities is futile without extensive sampling, and that some diversity indices can be estimated with reasonable accuracy through the analysis of clone libraries, but not from community fingerprint data.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, USA )
                1553-734X
                1553-7358
                September 2012
                September 2012
                20 September 2012
                : 8
                : 9
                : e1002687
                Affiliations
                [1 ]Computational & Systems Biology Initiative, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
                [2 ]Departments of Biological Engineering & Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
                [3 ]The Broad Institute, Cambridge, Massachusetts, United States of America
                University of Zurich and Swiss Institute of Bioinformatics, Switzerland
                Author notes

                The authors have declared that no competing interests exist.

                Conceived and designed the experiments: JF EJA. Performed the experiments: JF. Analyzed the data: JF. Wrote the paper: JF EJA.

                Article
                PCOMPBIOL-D-11-01310
                10.1371/journal.pcbi.1002687
                3447976
                23028285
                716d0905-4eaf-46f9-a990-e3014c31d91b
                Copyright @ 2012

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 2 September 2011
                : 23 July 2012
                Page count
                Pages: 11
                Funding
                This work was conducted by ENIGMA- Ecosystems and Networks Integrated with Genes and Molecular Assemblies ( http://enigma.lbl.gov), a Scientific Focus Area Program at Lawrence Berkeley National Laboratory, was supported by the Office of Science, Office of Biological and Environmental Research, of the U. S. Department of Energy under Contract No. DE-AC02-05CH11231. JF was supported by the Merck-MIT Fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology
                Computational Biology
                Genomics
                Metagenomics
                Ecology
                Community Ecology
                Community Assembly
                Species Interactions
                Microbial Ecology

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article