21
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Alfred: interactive multi-sample BAM alignment statistics, feature counting and feature annotation for long- and short-read sequencing

      brief-report

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Summary

          Harmonizing quality control (QC) of large-scale second and third-generation sequencing datasets is key for enabling downstream computational and biological analyses. We present Alfred, an efficient and versatile command-line application that computes multi-sample QC metrics in a read-group aware manner, across a wide variety of sequencing assays and technologies. In addition to standard QC metrics such as GC bias, base composition, insert size and sequencing coverage distributions it supports haplotype-aware and allele-specific feature counting and feature annotation. The versatility of Alfred allows for easy pipeline integration in high-throughput settings, including DNA sequencing facilities and large-scale research initiatives, enabling continuous monitoring of sequence data quality and characteristics across samples. Alfred supports haplo-tagging of BAM/CRAM files to conduct haplotype-resolved analyses in conjunction with a variety of next-generation sequencing based assays. Alfred’s companion web application enables interactive exploration of results and comparison to public datasets.

          Availability and implementation

          Alfred is open-source and freely available at https://tobiasrausch.com/alfred/.

          Supplementary information

          Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references4

          • Record: found
          • Abstract: found
          • Article: not found

          Efficient storage of high throughput DNA sequencing data using reference-based compression.

          Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. Of particular concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity. In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage. Our approach works for resequencing experiments that target well-studied genomes. We align new sequences to a reference genome and then encode the differences between the new sequence and the reference genome for storage. Our compression method is most efficient when we allow controlled loss of data in the saving of quality information and unaligned sequences. With this new compression method we observe exponential efficiency gains as read lengths increase, and the magnitude of this efficiency gain can be controlled by changing the amount of quality information stored. Our compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Standardization and quality management in next-generation sequencing

            DNA sequencing continues to evolve quickly even after > 30 years. Many new platforms suddenly appeared and former established systems have vanished in almost the same manner. Since establishment of next-generation sequencing devices, this progress gains momentum due to the continually growing demand for higher throughput, lower costs and better quality of data. In consequence of this rapid development, standardized procedures and data formats as well as comprehensive quality management considerations are still scarce. Here, we listed and summarized current standardization efforts and quality management initiatives from companies, organizations and societies in form of published studies and ongoing projects. These comprise on the one hand quality documentation issues like technical notes, accreditation checklists and guidelines for validation of sequencing workflows. On the other hand, general standard proposals and quality metrics are developed and applied to the sequencing workflow steps with the main focus on upstream processes. Finally, certain standard developments for downstream pipeline data handling, processing and storage are discussed in brief. These standardization approaches represent a first basis for continuing work in order to prospectively implement next-generation sequencing in important areas such as clinical diagnostics, where reliable results and fast processing is crucial. Additionally, these efforts will exert a decisive influence on traceability and reproducibility of sequence data.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Dense and accurate whole-chromosome haplotyping of individual genomes

              The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. This lack of haplotype-level analyses can be explained by a lack of methods that can produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single-cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. We provide comprehensive guidance on the required sequencing depths and reliably assign more than 95% of alleles (NA12878) to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different technologies represents an attractive solution to chart the genetic variation of diploid genomes.
                Bookmark

                Author and article information

                Contributors
                Role: Associate Editor
                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                July 2019
                06 December 2018
                06 December 2018
                : 35
                : 14
                : 2489-2491
                Affiliations
                [bty1007-aff1 ]Genomics Core Facility, European Molecular Biology Laboratory (EMBL), Meyerhofstrasse 1, Heidelberg, Germany
                [bty1007-aff2 ]Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Meyerhofstrasse 1, Heidelberg, Germany
                Author notes
                To whom correspondence should be addressed. Email: tobias.rausch@ 123456embl.de

                The authors wish it to be known that, in their opinion, Tobias Rausch and Markus Hsi-Yang Fritz authors should be regarded as Joint First Authors.

                The authors wish it to be known that, in their opinion, Jan O. Korbel and Vladimir Benes authors should be regarded as Joint Co-senior Authors.

                Author information
                http://orcid.org/0000-0001-5773-5620
                Article
                bty1007
                10.1093/bioinformatics/bty1007
                6612896
                30520945
                a96d3b85-b58b-41ca-a48a-5ac0fa37f97c
                © The Author(s) 2018. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 25 September 2018
                : 20 November 2018
                : 05 December 2018
                Page count
                Pages: 3
                Funding
                Funded by: NIH 10.13039/100000002
                Award ID: U41HG007497
                Categories
                Applications Notes
                Genome Analysis

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article