55
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many microbial groups. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of “over classification” is particularly detrimental in microbiome studies because reference taxonomies are far from comprehensive.

          Results

          Here, we introduce IDTAXA, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors. Using multiple reference taxonomies, we demonstrate that IDTAXA has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier. Similarly, IDTAXA yields far fewer over classifications on Illumina mock microbial community data when the expected taxa are absent from the training set. Furthermore, IDTAXA offers many practical advantages over other classifiers, such as maintaining low error rates across varying input sequence lengths and withholding classifications from input sequences composed of random nucleotides or repeats.

          Conclusions

          IDTAXA’s classifications may lead to different conclusions in microbiome studies because of the substantially reduced number of taxa that are incorrectly identified through over classification. Although misclassification error is relatively minor, we believe that many remaining misclassifications are likely caused by errors in the reference taxonomy. We describe how IDTAXA is able to identify many putative mislabeling errors in reference taxonomies, enabling training sets to be automatically corrected by eliminating spurious sequences. IDTAXA is part of the DECIPHER package for the R programming language, available through the Bioconductor repository or accessible online ( http://DECIPHER.codes).

          Electronic supplementary material

          The online version of this article (10.1186/s40168-018-0521-5) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references25

          • Record: found
          • Abstract: not found
          • Article: not found

          A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Comparative Analysis of Pyrosequencing and a Phylogenetic Microarray for Exploring Microbial Community Structures in the Human Distal Intestine

              Background Variations in the composition of the human intestinal microbiota are linked to diverse health conditions. High-throughput molecular technologies have recently elucidated microbial community structure at much higher resolution than was previously possible. Here we compare two such methods, pyrosequencing and a phylogenetic array, and evaluate classifications based on two variable 16S rRNA gene regions. Methods and Findings Over 1.75 million amplicon sequences were generated from the V4 and V6 regions of 16S rRNA genes in bacterial DNA extracted from four fecal samples of elderly individuals. The phylotype richness, for individual samples, was 1,400–1,800 for V4 reads and 12,500 for V6 reads, and 5,200 unique phylotypes when combining V4 reads from all samples. The RDP-classifier was more efficient for the V4 than for the far less conserved and shorter V6 region, but differences in community structure also affected efficiency. Even when analyzing only 20% of the reads, the majority of the microbial diversity was captured in two samples tested. DNA from the four samples was hybridized against the Human Intestinal Tract (HIT) Chip, a phylogenetic microarray for community profiling. Comparison of clustering of genus counts from pyrosequencing and HITChip data revealed highly similar profiles. Furthermore, correlations of sequence abundance and hybridization signal intensities were very high for lower-order ranks, but lower at family-level, which was probably due to ambiguous taxonomic groupings. Conclusions The RDP-classifier consistently assigned most V4 sequences from human intestinal samples down to genus-level with good accuracy and speed. This is the deepest sequencing of single gastrointestinal samples reported to date, but microbial richness levels have still not leveled out. A majority of these diversities can also be captured with five times lower sampling-depth. HITChip hybridizations and resulting community profiles correlate well with pyrosequencing-based compositions, especially for lower-order ranks, indicating high robustness of both approaches. However, incompatible grouping schemes make exact comparison difficult.
                Bookmark

                Author and article information

                Contributors
                murali5@wisc.edu
                aniruddha.j@gmail.com
                +1 (412) 383-4458 , eswright@pitt.edu
                Journal
                Microbiome
                Microbiome
                Microbiome
                BioMed Central (London )
                2049-2618
                9 August 2018
                9 August 2018
                2018
                : 6
                : 140
                Affiliations
                [1 ]ISNI 0000 0001 2167 3675, GRID grid.14003.36, Department of Computer Sciences, , University of Wisconsin-Madison, ; Madison, WI 53715 USA
                [2 ]ISNI 0000 0001 2167 3675, GRID grid.14003.36, Department of Electrical and Computer Engineering, , University of Wisconsin-Madison, ; Madison, WI 53715 USA
                [3 ]ISNI 0000 0004 1936 9000, GRID grid.21925.3d, Department of Biomedical Informatics, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, , University of Pittsburgh, ; 426 Bridgeside Point II, 450 Technology Dr, Pittsburgh, PA 15219 USA
                Author information
                http://orcid.org/0000-0002-1457-4019
                Article
                521
                10.1186/s40168-018-0521-5
                6085705
                30092815
                87d22e2c-aa56-4e82-b987-1d2e4c46b2ac
                © The Author(s). 2018

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 21 March 2018
                : 25 July 2018
                Categories
                Software
                Custom metadata
                © The Author(s) 2018

                microbiome,16s rrna gene sequencing,its sequencing,classification,taxonomic assignment,reference taxonomy

                Comments

                Comment on this article

                scite_
                0
                0
                0
                0
                Smart Citations
                0
                0
                0
                0
                Citing PublicationsSupportingMentioningContrasting
                View Citations

                See how this article has been cited at scite.ai

                scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

                Similar content120

                Cited by223

                Most referenced authors1,307