Sapling: accelerating suffix array queries with learned data models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation

As genomic data becomes more abundant, efficient algorithms and data structures for sequence alignment become increasingly important. The suffix array is a widely used data structure to accelerate alignment, but the binary search algorithm used to query, it requires widespread memory accesses, causing a large number of cache misses on large datasets.

Results

Here, we present Sapling, an algorithm for sequence alignment, which uses a learned data model to augment the suffix array and enable faster queries. We investigate different types of data models, providing an analysis of different neural network models as well as providing an open-source aligner with a compact, practical piecewise linear model. We show that Sapling outperforms both an optimized binary search approach and multiple widely used read aligners on a diverse collection of genomes, including human, bacteria and plants, speeding up the algorithm by more than a factor of two while adding <1% to the suffix array’s memory footprint.

Availability and implementation

The source code and tutorial are available open-source at https://github.com/mkirsche/sapling.

Supplementary information

Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 24

Record: found
Abstract: found
Article: not found

STAR: ultrafast universal RNA-seq aligner.

Alexander Dobin, Carrie A. Davis, Felix Schlesinger … (2013)

Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

0 comments Cited 16629 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The Sequence Alignment/Map format and SAMtools

Heng Li, Bob Handsaker, Alec Wysoker … (2009)

Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: rd@sanger.ac.uk

0 comments Cited 16381 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Fast gapped-read alignment with Bowtie 2.

Ben Langmead, Steven L Salzberg (2012)

As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

0 comments Cited 15545 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Melanie Kirsche: (View ORCID Profile)

Arun Das: (View ORCID Profile)

Journal

Title: Bioinformatics

Publisher: Oxford University Press (OUP)

ISSN (Print): 1367-4803

ISSN (Electronic): 1460-2059

Publication date Created: March 15 2021

Publication date Created: May 05 2021

Publication date Created: October 27 2020

Publication date Other: March 15 2021

Publication date (Print): May 05 2021

Publication date (Electronic): October 27 2020

Volume: 37

Issue: 6

Pages: 744-749

Affiliations

[1 ]Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA

[2 ]Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA

[3 ]Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA

Article

DOI: 10.1093/bioinformatics/btaa911

PubMed ID: 33107913

SO-VID: 270c7033-43d4-4d39-8c05-face8d6ac9af

License:

https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model

History

Data availability:

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 6

See all cited by

Most referenced authors 1,100

See all reference authors

Sapling: accelerating suffix array queries with learned data models

Read this article at

Abstract

Motivation

Results

Availability and implementation

Supplementary information

Related collections

Learned Publishing

Most cited references 24

STAR: ultrafast universal RNA-seq aligner.

The Sequence Alignment/Map format and SAMtools

Fast gapped-read alignment with Bowtie 2.

Author and article information

Contributors

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 460

Cited by 6

Most referenced authors 1,100