lra: A long read aligner for sequences and contigs

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda ( https://anaconda.org/bioconda/lra) and github ( https://github.com/ChaissonLab/LRA).

Author summary

Any two human genomes will have sequence differences across multiple scales: from single-nucleotide variants to large gains, losses, or rearrangements of DNA called structural variants. Long-read single-molecule sequencing has been shown to help discover structural variation because the reads span across the entire variant. The computational problem for discovering a structural variant is to find the optimal alignment of the read to the genome with gaps that accurately reflect the variant. Here we demonstrate a method, lra, that uses an efficient implementation of concave-cost alignment for structural variant discovery using long reads. On standardized benchmark data, we show that structural variant discovery is improved for multiple combinations of variant detection algorithms and long-read sequence using alignments generated by lra compared to existing methods. Finally, we show that it is possible to use lra to accurately discover a complete spectrum of structural variants using de novo assemblies constructed from long-read sequence data. This implies a future model of comparative genomics where variants are discovered only by comparing de novo assemblies and not a comparison of reads against a reference.

Related collections

Most cited references 31

Record: found
Abstract: found
Article: not found

Minimap2: pairwise alignment for nucleotide sequences

Heng Li (2018)

Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.

0 comments Cited 4209 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Assembly of long, error-prone reads using repeat graphs

Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin … (2019)

Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most long-read assembly algorithms do not provide the repeat characterization necessary for producing optimal assemblies. Here, we present Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs. We benchmark Flye against five state-of-the-art assemblers and show that it generates better or comparable assemblies, while being an order of magnitude faster. Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers.

0 comments Cited 1726 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm

Haoyu Cheng, Gregory Concepcion, Xiaowen Feng … (2021)

Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly.

0 comments Cited 1255 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Jingwen Ren:

ORCID: https://orcid.org/0000-0002-3356-3008

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: ValidationRole: Writing – original draftRole: Writing – review & editing

Mark J. P. Chaisson:

ORCID: https://orcid.org/0000-0001-5395-1457

Role: ConceptualizationRole: Formal analysisRole: Funding acquisitionRole: InvestigationRole: MethodologyRole: Project administrationRole: SoftwareRole: SupervisionRole: Writing – original draftRole: Writing – review & editing

Ferhat Ay: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput Biol

Journal ID (publisher-id): plos

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date Collection: June 2021

Publication date (Electronic): 21 June 2021

Volume: 17

Issue: 6

Electronic Location Identifier: e1009078

Affiliations

[001] Department of Quantitative and Computational Biology (QCB), University of Southern California, Los Angeles, California, the United States of America

La Jolla Institute for Allergy and Immunology, UNITED STATES

Author notes

The authors have declared that no competing interests exist.

* E-mail: mchaisso@ 123456usc.edu

Author information

Jingwen Ren https://orcid.org/0000-0002-3356-3008

Mark J. P. Chaisson https://orcid.org/0000-0001-5395-1457

Article

Publisher ID: PCOMPBIOL-D-20-02184

DOI: 10.1371/journal.pcbi.1009078

PMC ID: 8248648

PubMed ID: 34153026

SO-VID: 2ca8f790-c981-4ec8-8c39-349c5b4f9e59

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 4 December 2020

Date accepted : 13 May 2021

Page count

Figures: 4, Tables: 6, Pages: 23

Funding

Funded by: funder-id http://dx.doi.org/10.13039/100000051, National Human Genome Research Institute;

Award ID: U24HG007497.

Award Recipient :

ORCID: https://orcid.org/0000-0001-5395-1457

Mark J. P. Chaisson

Funded by: funder-id http://dx.doi.org/10.13039/100000051, National Human Genome Research Institute;

Award ID: 1U01HG010973

Award Recipient :

ORCID: https://orcid.org/0000-0001-5395-1457

Mark J. P. Chaisson

M.J.P.C. is supported by NHGRI U24HG007497 and NHGRI 1U01HG010973. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Custom metadata

PLOS Publication Stage vor-update-to-uncorrected-proof

Publication Update 2021-07-01

ScienceOpen disciplines: Quantitative & Systems biology

Data availability:

ScienceOpen disciplines: Quantitative & Systems biology

Comments

Comment on this article

scite_

Cited by 40

See all cited by

Most referenced authors 3,633

See all reference authors

- Version 1

lra: A long read aligner for sequences and contigs

Read this article at

Abstract

Author summary

Related collections

Journal of Systems Thinking Preprints

Most cited references 31

Minimap2: pairwise alignment for nucleotide sequences

Assembly of long, error-prone reads using repeat graphs

Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 74

Cited by 40

Most referenced authors 3,633