Profile Hidden Markov Models for the Detection of Viruses within Metagenomic Sequence Data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Rapid, sensitive, and specific virus detection is an important component of clinical diagnostics. Massively parallel sequencing enables new diagnostic opportunities that complement traditional serological and PCR based techniques. While massively parallel sequencing promises the benefits of being more comprehensive and less biased than traditional approaches, it presents new analytical challenges, especially with respect to detection of pathogen sequences in metagenomic contexts. To a first approximation, the initial detection of viruses can be achieved simply through alignment of sequence reads or assembled contigs to a reference database of pathogen genomes with tools such as BLAST. However, recognition of highly divergent viral sequences is problematic, and may be further complicated by the inherently high mutation rates of some viral types, especially RNA viruses. In these cases, increased sensitivity may be achieved by leveraging position-specific information during the alignment process. Here, we constructed HMMER3-compatible profile hidden Markov models (profile HMMs) from all the virally annotated proteins in RefSeq in an automated fashion using a custom-built bioinformatic pipeline. We then tested the ability of these viral profile HMMs (“vFams”) to accurately classify sequences as viral or non-viral. Cross-validation experiments with full-length gene sequences showed that the vFams were able to recall 91% of left-out viral test sequences without erroneously classifying any non-viral sequences into viral protein clusters. Thorough reanalysis of previously published metagenomic datasets with a set of the best-performing vFams showed that they were more sensitive than BLAST for detecting sequences originating from more distant relatives of known viruses. To facilitate the use of the vFams for rapid detection of remote viral homologs in metagenomic data, we provide two sets of vFams, comprising more than 4,000 vFams each, in the HMMER3 format. We also provide the software necessary to build custom profile HMMs or update the vFams as more viruses are discovered ( http://derisilab.ucsf.edu/software/vFam).

Related collections

Most cited references 31

Record: found
Abstract: found
Article: not found

Profile hidden Markov models.

S. Eddy (1998)

The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.

0 comments Cited 1289 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Protein homology detection by HMM-HMM comparison.

Johannes Söding (2005)

Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER and the profile-profile comparison tools PROF_SIM and COMPASS, in an all-against-all comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%.Sensitivity: When the predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile-profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7 and 3.3 times more good alignments ('balanced' score >0.3) than the next best method (COMPASS), and 1.6, 2.9 and 9.4 times more than PSI-BLAST, at the family, superfamily and fold level, respectively.Speed: HHsearch scans a query of 200 residues against 3691 domains in 33 s on an AMD64 2GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than COMPASS.

0 comments Cited 971 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Improved tools for biological sequence comparison.

W R Pearson, D J Lipman (1988)

We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.

0 comments Cited 852 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Herman Tse: Role: Editor

Journal

Journal ID (nlm-ta): PLoS One

Journal ID (iso-abbrev): PLoS ONE

Journal ID (publisher-id): plos

Journal ID (pmc): plosone

Title: PLoS ONE

Publisher: Public Library of Science (San Francisco, USA )

ISSN (Electronic): 1932-6203

Publication date Collection: 2014

Publication date (Electronic): 20 August 2014

Volume: 9

Issue: 8

Electronic Location Identifier: e105067

Affiliations

[1 ]Biological and Medical Informatics Graduate Program, University of California San Francisco, San Francisco, California, United States of America

[2 ]Departments of Medicine, Biochemistry and Biophysics, and Microbiology, University of California San Francisco, San Francisco, California, United States of America

[3 ]The J. David Gladstone Institutes, University of California San Francisco, San Francisco, California, United States of America

[4 ]Institute for Human Genetics & Division of Biostatistics, University of California San Francisco, San Francisco, California, United States of America

[5 ]Howard Hughes Medical Institute, Bethesda, Maryland, United States of America

The University of Hong Kong, Hong Kong

Author notes

* E-mail: joe@ 123456derisilab.ucsf.edu

Competing Interests: The authors have declared that no competing interests exist.

Conceived and designed the experiments: PSC TJS KSP JLD. Performed the experiments: PSC. Analyzed the data: PSC. Contributed reagents/materials/analysis tools: PSC. Wrote the paper: PSC TJS KSP JLD. Designed the software used in analysis: PSC TJS. Wrote the software used in analysis: PSC.

[¤a]

Current address: Novartis Institutes for BioMedical Research, Emeryville, California, United States of America

[¤b]

Current address: Department of Microbiology, Oregon State University, Corvallis, Oregon, United States of America

Article

Publisher ID: PONE-D-14-07331

DOI: 10.1371/journal.pone.0105067

PMC ID: 4139300

PubMed ID: 25140992

SO-VID: 04f79d38-39b8-4934-8097-e9831d949d71

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 16 February 2014

Date accepted : 20 July 2014

Page count

Pages: 12

Funding

This work was supported by the Howard Hughes Medical Institute (JLD), the Gordon and Betty Moore Foundation (Grants #1660 and #3300), the National Science Foundation (Grant #DMS-1069303), and Gladstone Institutes (KSP, TJS), the Scleroderma Research Foundation and the PhRMA Foundation Pre-Doctoral Bioinformatics Fellowship program (PS-C). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Profile Hidden Markov Models for the Detection of Viruses within Metagenomic Sequence Data

Read this article at

Abstract

Related collections

Exponential Random Graph Models

Most cited references 31

Profile hidden Markov models.

Protein homology detection by HMM-HMM comparison.

Improved tools for biological sequence comparison.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 198

Cited by 82

Most referenced authors 1,543