Transcript length bias in RNA-seq data confounds systems biology

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Several recent studies have demonstrated the effectiveness of deep sequencing for transcriptome analysis (RNA-seq) in mammals. As RNA-seq becomes more affordable, whole genome transcriptional profiling is likely to become the platform of choice for species with good genomic sequences. As yet, a rigorous analysis methodology has not been developed and we are still in the stages of exploring the features of the data.

Results

We investigated the effect of transcript length bias in RNA-seq data using three different published data sets. For standard analyses using aggregated tag counts for each gene, the ability to call differentially expressed genes between samples is strongly associated with the length of the transcript.

Conclusion

Transcript length bias for calling differentially expressed genes is a general feature of current protocols for RNA-seq technology. This has implications for the ranking of differentially expressed genes, and in particular may introduce bias in gene set testing for pathway analysis and other multi-gene systems biology analyses.

Reviewers

This article was reviewed by Rohan Williams (nominated by Gavin Huttley), Nicole Cloonan (nominated by Mark Ragan) and James Bullard (nominated by Sandrine Dudoit).

Related collections

Most cited references 2

Record: found
Abstract: found
Article: not found

Stochastic models inspired by hybridization theory for short oligonucleotide arrays.

Zhijin Wu, Rafael A. Irizarry (2015)

High density oligonucleotide expression arrays are a widely used tool for the measurement of gene expression on a large scale. Affymetrix GeneChip arrays appear to dominate this market. These arrays use short oligonucleotides to probe for genes in an RNA sample. Due to optical noise, nonspecific hybridization, probe-specific effects, and measurement error, ad hoc measures of expression that summarize probe intensities can lead to imprecise and inaccurate results. Various researchers have demonstrated that expression measures based on simple statistical models can provide great improvements over the ad hoc procedure offered by Affymetrix. Recently, physical models based on molecular hybridization theory have been proposed as useful tools for prediction of, for example, nonspecific hybridization. These physical models show great potential in terms of improving existing expression measures. In this paper, we suggest that the system producing the measured intensities is too complex to be fully described with these relatively simple physical models, and we propose empirically motivated stochastic models that complement the above-mentioned molecular hybridization theory to provide a comprehensive description of the data. We discuss how the proposed model can be used to obtain improved measures of expression useful for the data analysts.

0 comments Cited 86 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Statistical issues in the analysis of Illumina data

Mark J Dunning, Nuno Barbosa-Morais, Andy G. Lynch … (2008)

Background Illumina bead-based arrays are becoming increasingly popular due to their high degree of replication and reported high data quality. However, little attention has been paid to the pre-processing of Illumina data. In this paper, we present our experience of analysing the raw data from an Illumina spike-in experiment and offer guidelines for those wishing to analyse expression data or develop new methodologies for this technology. Results We find that the local background estimated by Illumina is consistently low, and subtracting this background is beneficial for detecting differential expression (DE). Illumina's summary method performs well at removing outliers, producing estimates which are less biased and are less variable than other robust summary methods. However, quality assessment on summarised data may miss spatial artefacts present in the raw data. Also, we find that the background normalisation method used in Illumina's proprietary software (BeadStudio) can cause problems with a standard DE analysis. We demonstrate that variances calculated from the raw data can be used as inverse weights in the DE analysis to improve power. Finally, variability in both expression levels and DE statistics can be attributed to differences in probe composition. These differences are not accounted for by current analysis methods and require further investigation. Conclusion Analysing Illumina expression data using BeadStudio is reasonable because of the conservative estimates of summary values produced by the software. Improvements can however be made by not using background normalisation. Access to the raw data allows for a more detailed quality assessment and flexible analyses. In the case of a gene expression study, data can be analysed on an appropriate scale using established tools. Similar improvements can be expected for other Illumina assays.

0 comments Cited 51 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Biol Direct

Title: Biology Direct

Publisher: BioMed Central

ISSN (Electronic): 1745-6150

Publication date Collection: 2009

Publication date (Electronic): 16 April 2009

Volume: 4

Page: 14

Affiliations

[1 ]Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Vic 3052, Australia

Article

Publisher ID: 1745-6150-4-14

DOI: 10.1186/1745-6150-4-14

PMC ID: 2678084

PubMed ID: 19371405

SO-VID: 7e67f579-5af6-450e-8a88-002fcea0b5aa

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 9 April 2009

Date accepted : 16 April 2009

Comments

Comment on this article

scite_

Cited by 205

See all cited by

Most referenced authors 623

See all reference authors

- Version 1