Comparing de novo assemblers for 454 transcriptome data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Roche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base) reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC) to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis.

Results

Although no single assembler performed best on all our criteria, Newbler 2.5 gave longer contigs, better alignments to some reference sequences, and was fast and easy to use. SeqMan assemblies performed best on the criterion of recapitulating known transcripts, and had more novel sequence than the other assemblers, but generated an excess of small, redundant contigs. The remaining assemblers all performed almost as well, with the exception of Newbler 2.3 (the version currently used by most assembly projects), which generated assemblies that had significantly lower total length. As different assemblers use different underlying algorithms to generate contigs, we also explored merging of assemblies and found that the merged datasets not only aligned better to reference sequences than individual assemblies, but were also more consistent in the number and size of contigs.

Conclusions

Transcriptome assemblies are smaller than genome assemblies and thus should be more computationally tractable, but are often harder because individual contigs can have highly variable read coverage. Comparing single assemblers, Newbler 2.5 performed best on our trial data set, but other assemblers were closely comparable. Combining differently optimal assemblies from different programs however gave a more credible final product, and this strategy is recommended.

Related collections

Most cited references 40

Record: found
Abstract: found
Article: not found

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

Geo M Pertea, Xiaoqiu Huang, Feng Liang … (2003)

TGICL is a pipeline for analysis of large Expressed Sequence Tags (EST) and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters (optionally with quality values) to produce longer, more complete consensus sequences. The system can run on multi-CPU architectures including SMP and PVM.

0 comments Cited 797 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

SSAHA: a fast search method for large DNA databases.

Z. Ning, A Cox, J Mullikin (2001)

We describe an algorithm, SSAHA (Sequence Search and Alignment by Hashing Algorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k-tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching for a query sequence in the database is done by obtaining from the hash table the "hits" for each k-tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.

0 comments Cited 375 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing.

J. Vera, Christopher Wheat, Howard Fescemyer … (2008)

We present a de novo assembly of a eukaryote transcriptome using 454 pyrosequencing data. The Glanville fritillary butterfly (Melitaea cinxia; Lepidoptera: Nymphalidae) is a prominent species in population biology but had no previous genomic data. Sequencing runs using two normalized complementary DNA collections from a genetically diverse pool of larvae, pupae, and adults yielded 608,053 expressed sequence tags (mean length = 110 nucleotides), which assembled into 48,354 contigs (sets of overlapping DNA segments) and 59,943 singletons. BLAST comparisons confirmed the accuracy of the sequencing and assembly, and indicated the presence of c. 9000 unique genes, along with > 6000 additional microarray-confirmed unannotated contigs. Average depth of coverage was 6.5-fold for the longest 4800 contigs (348-2849 bp in length), sufficient for detecting large numbers of single nucleotide polymorphisms. Oligonucleotide microarray probes designed from the assembled sequences showed highly repeatable hybridization intensity and revealed biological differences among individuals. We conclude that 454 sequencing, when performed to provide sufficient coverage depth, allows de novo transcriptome assembly and a fast, cost-effective, and reliable method for development of functional genomic tools for nonmodel species. This development narrows the gap between approaches based on model organisms with rich genetic resources vs. species that are most tractable for ecological and evolutionary studies.

0 comments Cited 276 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): BMC Genomics

Title: BMC Genomics

Publisher: BioMed Central

ISSN (Electronic): 1471-2164

Publication date Collection: 2010

Publication date (Electronic): 16 October 2010

Volume: 11

Page: 571

Affiliations

[1 ]Institute of Evolutionary Biology, University of Edinburgh, West Mains Road, Edinburgh EH9 3JT, UK

Article

Publisher ID: 1471-2164-11-571

DOI: 10.1186/1471-2164-11-571

PMC ID: 3091720

PubMed ID: 20950480

SO-VID: 1f940d11-9491-4524-bbee-1a0b6e4c2321

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comparing de novo assemblers for 454 transcriptome data

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Arabidopsis genomics

Most cited references 40

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

SSAHA: a fast search method for large DNA databases.

Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 132

Cited by 114

Most referenced authors 2,180