98
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Transcript length bias in RNA-seq data confounds systems biology

      research-article
      1 , , 1
      Biology Direct
      BioMed Central

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Several recent studies have demonstrated the effectiveness of deep sequencing for transcriptome analysis (RNA-seq) in mammals. As RNA-seq becomes more affordable, whole genome transcriptional profiling is likely to become the platform of choice for species with good genomic sequences. As yet, a rigorous analysis methodology has not been developed and we are still in the stages of exploring the features of the data.

          Results

          We investigated the effect of transcript length bias in RNA-seq data using three different published data sets. For standard analyses using aggregated tag counts for each gene, the ability to call differentially expressed genes between samples is strongly associated with the length of the transcript.

          Conclusion

          Transcript length bias for calling differentially expressed genes is a general feature of current protocols for RNA-seq technology. This has implications for the ranking of differentially expressed genes, and in particular may introduce bias in gene set testing for pathway analysis and other multi-gene systems biology analyses.

          Reviewers

          This article was reviewed by Rohan Williams (nominated by Gavin Huttley), Nicole Cloonan (nominated by Mark Ragan) and James Bullard (nominated by Sandrine Dudoit).

          Related collections

          Most cited references2

          • Record: found
          • Abstract: found
          • Article: not found

          Stochastic models inspired by hybridization theory for short oligonucleotide arrays.

          High density oligonucleotide expression arrays are a widely used tool for the measurement of gene expression on a large scale. Affymetrix GeneChip arrays appear to dominate this market. These arrays use short oligonucleotides to probe for genes in an RNA sample. Due to optical noise, nonspecific hybridization, probe-specific effects, and measurement error, ad hoc measures of expression that summarize probe intensities can lead to imprecise and inaccurate results. Various researchers have demonstrated that expression measures based on simple statistical models can provide great improvements over the ad hoc procedure offered by Affymetrix. Recently, physical models based on molecular hybridization theory have been proposed as useful tools for prediction of, for example, nonspecific hybridization. These physical models show great potential in terms of improving existing expression measures. In this paper, we suggest that the system producing the measured intensities is too complex to be fully described with these relatively simple physical models, and we propose empirically motivated stochastic models that complement the above-mentioned molecular hybridization theory to provide a comprehensive description of the data. We discuss how the proposed model can be used to obtain improved measures of expression useful for the data analysts.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Statistical issues in the analysis of Illumina data

            Background Illumina bead-based arrays are becoming increasingly popular due to their high degree of replication and reported high data quality. However, little attention has been paid to the pre-processing of Illumina data. In this paper, we present our experience of analysing the raw data from an Illumina spike-in experiment and offer guidelines for those wishing to analyse expression data or develop new methodologies for this technology. Results We find that the local background estimated by Illumina is consistently low, and subtracting this background is beneficial for detecting differential expression (DE). Illumina's summary method performs well at removing outliers, producing estimates which are less biased and are less variable than other robust summary methods. However, quality assessment on summarised data may miss spatial artefacts present in the raw data. Also, we find that the background normalisation method used in Illumina's proprietary software (BeadStudio) can cause problems with a standard DE analysis. We demonstrate that variances calculated from the raw data can be used as inverse weights in the DE analysis to improve power. Finally, variability in both expression levels and DE statistics can be attributed to differences in probe composition. These differences are not accounted for by current analysis methods and require further investigation. Conclusion Analysing Illumina expression data using BeadStudio is reasonable because of the conservative estimates of summary values produced by the software. Improvements can however be made by not using background normalisation. Access to the raw data allows for a more detailed quality assessment and flexible analyses. In the case of a gene expression study, data can be analysed on an appropriate scale using established tools. Similar improvements can be expected for other Illumina assays.
              Bookmark

              Author and article information

              Journal
              Biol Direct
              Biology Direct
              BioMed Central
              1745-6150
              2009
              16 April 2009
              : 4
              : 14
              Affiliations
              [1 ]Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Vic 3052, Australia
              Article
              1745-6150-4-14
              10.1186/1745-6150-4-14
              2678084
              19371405
              7e67f579-5af6-450e-8a88-002fcea0b5aa
              Copyright © 2009 Oshlack and Wakefield; licensee BioMed Central Ltd.

              This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

              History
              : 9 April 2009
              : 16 April 2009
              Categories
              Research

              Life sciences
              Life sciences

              Comments

              Comment on this article