Lighter: fast and memory-efficient sequencing error correction without counting

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-014-0509-9) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 19

Record: found
Abstract: found
Article: not found

ART: a next-generation sequencing read simulator.

Weichun Huang, Leping Li, Jason R. Myers … (2012)

ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles. Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art.

0 comments Cited 671 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Informed and automated k-mer size selection for genome assembly.

Rayan Chikhi, Paul Medvedev (2014)

Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies. Our tool KmerGenie is freely available at: http://kmergenie.bx.psu.edu/.

0 comments Cited 334 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

S. L. Salzberg, A. M. Phillippy, A. Zimin … (2012)

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

0 comments Cited 291 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Li Song: lsong10@jhu.edu

Liliana Florea: florea@jhu.edu

Ben Langmead: langmea@cs.jhu.edu

Journal

Journal ID (nlm-ta): Genome Biol

Title: Genome Biology

Publisher: BioMed Central (London )

ISSN (Print): 1465-6906

ISSN (Electronic): 1465-6914

Publication date (Electronic): 15 November 2014

Publication date PMC-release: 15 November 2014

Publication date (Print): 2014

Volume: 15

Issue: 11

Electronic Location Identifier: 509

Affiliations

[ ]Department of Computer Science, Johns Hopkins University, Baltimore, 21218 USA

[ ]McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, 21205 USA

Article

Publisher ID: 509

DOI: 10.1186/s13059-014-0509-9

PMC ID: 4248469

PubMed ID: 25398208

SO-VID: 151e9d5e-e2d6-42de-a361-ae036e310fd1

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 27 May 2014

Date accepted : 29 September 2014

Custom metadata

ScienceOpen disciplines: Genetics

Data availability:

ScienceOpen disciplines: Genetics

Comments

Comment on this article

scite_

Cited by 103

See all cited by

Most referenced authors 402

See all reference authors

Lighter: fast and memory-efficient sequencing error correction without counting

Read this article at

Abstract

Electronic supplementary material

Related collections

Genome Integrity

Most cited references 19

ART: a next-generation sequencing read simulator.

Informed and automated k-mer size selection for genome assembly.

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 57

Cited by 103

Most referenced authors 402