UPS-indel: a Universal Positioning System for Indels

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and misleads downstream analysis. It is thus desirable to have a unified system for identifying and representing equivalent indels. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare different indel calling results. UPS-indel identifies 15% redundant indels in dbSNP, 29% in COSMIC coding, and 13% in COSMIC noncoding datasets across all human chromosomes, higher than previously reported. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to state-of-the-art approaches for indel call set comparison demonstrates its clear superiority in finding common indels among call sets. UPS-indel is theoretically proven to find all equivalent indels, and thus exhaustive.

Related collections

Most cited references 10

Record: found
Abstract: found
Article: found

Is Open Access

MTML-msBayes: Approximate Bayesian comparative phylogeographic inference from multiple taxa and multiple loci with rate heterogeneity

Wen Huang, Naoki Takebayashi, Yan Qi … (2011)

Background MTML-msBayes uses hierarchical approximate Bayesian computation (HABC) under a coalescent model to infer temporal patterns of divergence and gene flow across codistributed taxon-pairs. Under a model of multiple codistributed taxa that diverge into taxon-pairs with subsequent gene flow or isolation, one can estimate hyper-parameters that quantify the mean and variability in divergence times or test models of migration and isolation. The software uses multi-locus DNA sequence data collected from multiple taxon-pairs and allows variation across taxa in demographic parameters as well as heterogeneity in DNA mutation rates across loci. The method also allows a flexible sampling scheme: different numbers of loci of varying length can be sampled from different taxon-pairs. Results Simulation tests reveal increasing power with increasing numbers of loci when attempting to distinguish temporal congruence from incongruence in divergence times across taxon-pairs. These results are robust to DNA mutation rate heterogeneity. Estimating mean divergence times and testing simultaneous divergence was less accurate with migration, but improved if one specified the correct migration model. Simulation validation tests demonstrated that one can detect the correct migration or isolation model with high probability, and that this HABC model testing procedure was greatly improved by incorporating a summary statistic originally developed for this task (Wakeley's ΨW ). The method is applied to an empirical data set of three Australian avian taxon-pairs and a result of simultaneous divergence with some subsequent gene flow is inferred. Conclusions To retain flexibility and compatibility with existing bioinformatics tools, MTML-msBayes is a pipeline software package consisting of Perl, C and R programs that are executed via the command line. Source code and binaries are available for download at http://msbayes.sourceforge.net/ under an open source license (GNU Public License).

0 comments Cited 464 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Unified representation of genetic variants.

Adrian Tan, Gonçalo R Abecasis, Hyun Min Kang (2015)

A genetic variant can be represented in the Variant Call Format (VCF) in multiple different ways. Inconsistent representation of variants between variant callers and analyses will magnify discrepancies between them and complicate variant filtering and duplicate removal. We present a software tool vt normalize that normalizes representation of genetic variants in the VCF. We formally define variant normalization as the consistent representation of genetic variants in an unambiguous and concise way and derive a simple general algorithm to enforce it. We demonstrate the inconsistent representation of variants across existing sequence analysis tools and show that our tool facilitates integration of diverse variant types and call sets.

0 comments Cited 195 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

SeqAn An efficient, generic C++ library for sequence analysis

Andreas Gogol-Döring, David Weese, Tobias Rausch … (2008)

Background The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome [1] would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. Results To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use. Conclusion We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.

0 comments Cited 122 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Mohammad Shabbir Hasan:

ORCID: http://orcid.org/0000-0001-6263-631X

shabbir5@vt.edu

Liqing Zhang: lqzhang@vt.edu

Journal

Journal ID (nlm-ta): Sci Rep

Journal ID (iso-abbrev): Sci Rep

Title: Scientific Reports

Publisher: Nature Publishing Group UK (London )

ISSN (Electronic): 2045-2322

Publication date (Electronic): 26 October 2017

Publication date PMC-release: 26 October 2017

Publication date Collection: 2017

Volume: 7

Electronic Location Identifier: 14106

Affiliations

[1 ]ISNI 0000 0001 0694 4940, GRID grid.438526.e, Department of Computer Science, , Virginia Tech, ; Blacksburg, VA 24061 USA

[2 ]ISNI 0000 0001 0694 4940, GRID grid.438526.e, Department of Statistics, , Virginia Tech, ; Blacksburg, VA 24061 USA

[3 ]ISNI 0000 0001 0694 4940, GRID grid.438526.e, Department of Mathematics, , Virginia Tech, ; Blacksburg, VA 24061 USA

[4 ]ISNI 0000 0001 0694 4940, GRID grid.438526.e, Department of Aerospace and Ocean Engineering, , Virginia Tech, ; Blacksburg, VA 24061 USA

Author information

Mohammad Shabbir Hasan http://orcid.org/0000-0001-6263-631X

Article

Publisher ID: 14400

DOI: 10.1038/s41598-017-14400-1

PMC ID: 5658412

PubMed ID: 29074871

SO-VID: 8dc6737e-c8e1-4e92-a763-fdc09f6c3425

License:

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

History

Date received : 12 May 2017

Date accepted : 9 October 2017

Custom metadata

ScienceOpen disciplines: Uncategorized

Data availability:

ScienceOpen disciplines: Uncategorized

Comments

Comment on this article

scite_

Cited by 4

See all cited by

Most referenced authors 1,623

See all reference authors

UPS-indel: a Universal Positioning System for Indels

Read this article at

Abstract

Related collections

Universal stem cells

Most cited references 10

MTML-msBayes: Approximate Bayesian comparative phylogeographic inference from multiple taxa and multiple loci with rate heterogeneity

Unified representation of genetic variants.

SeqAn An efficient, generic C++ library for sequence analysis

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 28

Cited by 4

Most referenced authors 1,623