44
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Profile Hidden Markov Models for the Detection of Viruses within Metagenomic Sequence Data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Rapid, sensitive, and specific virus detection is an important component of clinical diagnostics. Massively parallel sequencing enables new diagnostic opportunities that complement traditional serological and PCR based techniques. While massively parallel sequencing promises the benefits of being more comprehensive and less biased than traditional approaches, it presents new analytical challenges, especially with respect to detection of pathogen sequences in metagenomic contexts. To a first approximation, the initial detection of viruses can be achieved simply through alignment of sequence reads or assembled contigs to a reference database of pathogen genomes with tools such as BLAST. However, recognition of highly divergent viral sequences is problematic, and may be further complicated by the inherently high mutation rates of some viral types, especially RNA viruses. In these cases, increased sensitivity may be achieved by leveraging position-specific information during the alignment process. Here, we constructed HMMER3-compatible profile hidden Markov models (profile HMMs) from all the virally annotated proteins in RefSeq in an automated fashion using a custom-built bioinformatic pipeline. We then tested the ability of these viral profile HMMs (“vFams”) to accurately classify sequences as viral or non-viral. Cross-validation experiments with full-length gene sequences showed that the vFams were able to recall 91% of left-out viral test sequences without erroneously classifying any non-viral sequences into viral protein clusters. Thorough reanalysis of previously published metagenomic datasets with a set of the best-performing vFams showed that they were more sensitive than BLAST for detecting sequences originating from more distant relatives of known viruses. To facilitate the use of the vFams for rapid detection of remote viral homologs in metagenomic data, we provide two sets of vFams, comprising more than 4,000 vFams each, in the HMMER3 format. We also provide the software necessary to build custom profile HMMs or update the vFams as more viruses are discovered ( http://derisilab.ucsf.edu/software/vFam).

          Related collections

          Most cited references31

          • Record: found
          • Abstract: found
          • Article: not found

          Profile hidden Markov models.

          S. Eddy (1998)
          The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Protein homology detection by HMM-HMM comparison.

            Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER and the profile-profile comparison tools PROF_SIM and COMPASS, in an all-against-all comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%.Sensitivity: When the predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile-profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7 and 3.3 times more good alignments ('balanced' score >0.3) than the next best method (COMPASS), and 1.6, 2.9 and 9.4 times more than PSI-BLAST, at the family, superfamily and fold level, respectively.Speed: HHsearch scans a query of 200 residues against 3691 domains in 33 s on an AMD64 2GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than COMPASS.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Improved tools for biological sequence comparison.

              We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2014
                20 August 2014
                : 9
                : 8
                : e105067
                Affiliations
                [1 ]Biological and Medical Informatics Graduate Program, University of California San Francisco, San Francisco, California, United States of America
                [2 ]Departments of Medicine, Biochemistry and Biophysics, and Microbiology, University of California San Francisco, San Francisco, California, United States of America
                [3 ]The J. David Gladstone Institutes, University of California San Francisco, San Francisco, California, United States of America
                [4 ]Institute for Human Genetics & Division of Biostatistics, University of California San Francisco, San Francisco, California, United States of America
                [5 ]Howard Hughes Medical Institute, Bethesda, Maryland, United States of America
                The University of Hong Kong, Hong Kong
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: PSC TJS KSP JLD. Performed the experiments: PSC. Analyzed the data: PSC. Contributed reagents/materials/analysis tools: PSC. Wrote the paper: PSC TJS KSP JLD. Designed the software used in analysis: PSC TJS. Wrote the software used in analysis: PSC.

                [¤a]

                Current address: Novartis Institutes for BioMedical Research, Emeryville, California, United States of America

                [¤b]

                Current address: Department of Microbiology, Oregon State University, Corvallis, Oregon, United States of America

                Article
                PONE-D-14-07331
                10.1371/journal.pone.0105067
                4139300
                25140992
                04f79d38-39b8-4934-8097-e9831d949d71
                Copyright @ 2014

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 16 February 2014
                : 20 July 2014
                Page count
                Pages: 12
                Funding
                This work was supported by the Howard Hughes Medical Institute (JLD), the Gordon and Betty Moore Foundation (Grants #1660 and #3300), the National Science Foundation (Grant #DMS-1069303), and Gladstone Institutes (KSP, TJS), the Scleroderma Research Foundation and the PhRMA Foundation Pre-Doctoral Bioinformatics Fellowship program (PS-C). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Genomic Databases
                Genetics
                Genomics
                Metagenomics
                Microbiology
                Virology
                Emerging Viral Diseases
                Computer and Information Sciences
                Software Engineering
                Software Tools
                Physical Sciences
                Mathematics
                Probability Theory
                Markov Models

                Uncategorized
                Uncategorized

                Comments

                Comment on this article