Understanding sequencing data as compositions: an outlook and review

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation

Although seldom acknowledged explicitly, count data generated by sequencing platforms exist as compositions for which the abundance of each component (e.g. gene or transcript) is only coherently interpretable relative to other components within that sample. This property arises from the assay technology itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.e. library size). Consequently, sequencing data, as compositional data, exist in a non-Euclidean space that, without normalization or transformation, renders invalid many conventional analyses, including distance measures, correlation coefficients and multivariate statistical models.

Results

The purpose of this review is to summarize the principles of compositional data analysis (CoDA), provide evidence for why sequencing data are compositional, discuss compositionally valid methods available for analyzing sequencing data, and highlight future directions with regard to this field of study.

Supplementary information

Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 32

Record: found
Abstract: found
Article: found

Is Open Access

Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package

Sonia Tarazona, Pedro Furió-Tarí, David Turrà … (2015)

As the use of RNA-seq has popularized, there is an increasing consciousness of the importance of experimental design, bias removal, accurate quantification and control of false positives for proper data analysis. We introduce the NOISeq R-package for quality control and analysis of count data. We show how the available diagnostic tools can be used to monitor quality issues, make pre-processing decisions and improve analysis. We demonstrate that the non-parametric NOISeqBIO efficiently controls false discoveries in experiments with biological replication and outperforms state-of-the-art methods. NOISeq is a comprehensive resource that meets current needs for robust data-aware analysis of RNA-seq differential expression.

0 comments Cited 316 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Synthetic spike-in standards for RNA-seq experiments.

Lichun Jiang, Felix Schlesinger, Carrie A. Davis … (2011)

High-throughput sequencing of cDNA (RNA-seq) is a widely deployed transcriptome profiling and annotation technique, but questions about the performance of different protocols and platforms remain. We used a newly developed pool of 96 synthetic RNAs with various lengths, and GC content covering a 2(20) concentration range as spike-in controls to measure sensitivity, accuracy, and biases in RNA-seq experiments as well as to derive standard curves for quantifying the abundance of transcripts. We observed linearity between read density and RNA input over the entire detection range and excellent agreement between replicates, but we observed significantly larger imprecision than expected under pure Poisson sampling errors. We use the control RNAs to directly measure reproducible protocol-dependent biases due to GC content and transcript length as well as stereotypic heterogeneity in coverage across transcripts correlated with position relative to RNA termini and priming sequence bias. These effects lead to biased quantification for short transcripts and individual exons, which is a serious problem for measurements of isoform abundances, but that can partially be corrected using appropriate models of bias. By using the control RNAs, we derive limits for the discovery and detection of rare transcripts in RNA-seq experiments. By using data collected as part of the model organism and human Encyclopedia of DNA Elements projects (ENCODE and modENCODE), we demonstrate that external RNA controls are a useful resource for evaluating sensitivity and accuracy of RNA-seq experiments for transcriptome discovery and quantification. These quality metrics facilitate comparable analysis across different samples, protocols, and platforms.

0 comments Cited 308 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Revisiting global gene expression analysis.

Jakob Lovén, David Orlando, Alla A. Sigova … (2012)

Gene expression analysis is a widely used and powerful method for investigating the transcriptional behavior of biological systems, for classifying cell states in disease, and for many other purposes. Recent studies indicate that common assumptions currently embedded in experimental and analytical practices can lead to misinterpretation of global gene expression data. We discuss these assumptions and describe solutions that should minimize erroneous interpretation of gene expression data from multiple analysis platforms. Copyright © 2012 Elsevier Inc. All rights reserved.

0 comments Cited 260 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Jonathan Wren: Role: Associate Editor

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 15 August 2018

Publication date (Electronic): 28 March 2018

Publication date PMC-release: 28 March 2018

Volume: 34

Issue: 16

Pages: 2870-2878

Affiliations

[1 ]Bioinformatics Core Research Group, Deakin University, Geelong, Australia

[2 ]Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain

[3 ]Universitat Pompeu Fabra (UPF), Barcelona, Spain

[4 ]Centre for Integrative Ecology, School of Life and Environmental Sciences, Deakin University, Geelong, Australia

[5 ]Poultry Hub Australia, University of New England, Armidale, Australia

Author notes

To whom correspondence should be addressed. Email: contacttomquinn@ 123456gmail.com

Author information

Thomas P Quinn http://orcid.org/0000-0003-0286-6329

Article

Publisher ID: bty175

DOI: 10.1093/bioinformatics/bty175

PMC ID: 6084572

PubMed ID: 29608657

SO-VID: 91971681-8b42-4fc9-b6c2-ba1119184dc2

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

History

Date received : 24 October 2017

Date revision received : 20 March 2018

Date accepted : 26 March 2018

Page count

Pages: 9

Comments

Comment on this article

scite_

Cited by 130

See all cited by

Most referenced authors 1,165

See all reference authors

Understanding sequencing data as compositions: an outlook and review

Read this article at

Abstract

Motivation

Results

Supplementary information

Related collections

REPO4EU WP2 Databases

Most cited references 32

Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package

Synthetic spike-in standards for RNA-seq experiments.

Revisiting global gene expression analysis.

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 166

Cited by 130

Most referenced authors 1,165