CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Thousands of novel transcripts have been identified using deep transcriptome sequencing. This discovery of large and ‘hidden’ transcriptome rejuvenates the demand for methods that can rapidly distinguish between coding and noncoding RNA. Here, we present a novel alignment-free method, Coding Potential Assessment Tool (CPAT), which rapidly recognizes coding and noncoding transcripts from a large pool of candidates. To this end, CPAT uses a logistic regression model built with four sequence features: open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT software outperformed (sensitivity: 0.96, specificity: 0.97) other state-of-the-art alignment-based software such as Coding-Potential Calculator (sensitivity: 0.99, specificity: 0.74) and Phylo Codon Substitution Frequencies (sensitivity: 0.90, specificity: 0.63). In addition to high accuracy, CPAT is approximately four orders of magnitude faster than Coding-Potential Calculator and Phylo Codon Substitution Frequencies, enabling its users to process thousands of transcripts within seconds. The software accepts input sequences in either FASTA- or BED-formatted data files. We also developed a web interface for CPAT that allows users to submit sequences and receive the prediction results almost instantly.

Related collections

Most cited references 15

Record: found
Abstract: found
Article: not found

The transcriptional landscape of the mammalian genome.

P Carninci, T Kasukawa, S. Katayama … (2005)

This study describes comprehensive polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome. We identify the 5' and 3' boundaries of 181,047 transcripts with extensive variation in transcripts arising from alternative promoter usage, splicing, and polyadenylation. There are 16,247 new mouse protein-coding transcripts, including 5154 encoding previously unidentified proteins. Genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. The data provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.

0 comments Cited 1298 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

RNA maps reveal new RNA classes and a possible function for pervasive transcription.

Philipp Kapranov, Jill Cheng, Sujit Dike … (2007)

Significant fractions of eukaryotic genomes give rise to RNA, much of which is unannotated and has reduced protein-coding potential. The genomic origins and the associations of human nuclear and cytosolic polyadenylated RNAs longer than 200 nucleotides (nt) and whole-cell RNAs less than 200 nt were investigated in this genome-wide study. Subcellular addresses for nucleotides present in detected RNAs were assigned, and their potential processing into short RNAs was investigated. Taken together, these observations suggest a novel role for some unannotated RNAs as primary transcripts for the production of short RNAs. Three potentially functional classes of RNAs have been identified, two of which are syntenically conserved and correlate with the expression state of protein-coding genes. These data support a highly interleaved organization of the human transcriptome.

0 comments Cited 1006 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs

Mitchell Guttman, Manuel Garber, Joshua Levin … (2010)

RNA-Seq provides an unbiased way to study a transcriptome, including both coding and non-coding genes. To date, most RNA-Seq studies have critically depended on existing annotations, and thus focused on expression levels and variation in known transcripts. Here, we present Scripture, a method to reconstruct the transcriptome of a mammalian cell using only RNA-Seq reads and the genome sequence. We apply it to mouse embryonic stem cells, neuronal precursor cells, and lung fibroblasts to accurately reconstruct the full-length gene structures for the vast majority of known expressed genes. We identify substantial variation in protein-coding genes, including thousands of novel 5′-start sites, 3′-ends, and internal coding exons. We then determine the gene structures of over a thousand lincRNA and antisense loci. Our results open the way to direct experimental manipulation of thousands of non-coding RNAs, and demonstrate the power of ab initio reconstruction to render a comprehensive picture of mammalian transcriptomes.

0 comments Cited 502 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Nucleic Acids Res

Journal ID (iso-abbrev): Nucleic Acids Res

Journal ID (publisher-id): nar

Journal ID (hwp): nar

Title: Nucleic Acids Research

Publisher: Oxford University Press

ISSN (Print): 0305-1048

ISSN (Electronic): 1362-4962

Publication date (Print): April 2013

Publication date (Electronic): 17 January 2013

Publication date PMC-release: 17 January 2013

Volume: 41

Issue: 6

Page: e74

Affiliations

¹Division of Biomedical Statistics and Informatics, Mayo Clinic College of Medicine, Rochester, MN 55905, USA, ²Division of Biostatistics, Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, TX 77030, USA, ³Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030, USA and ⁴State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210000, China

Author notes

*To whom correspondence should be addressed. Tel: +1 507 538 8315; Fax: +1 507 284 0360; Email: kocher.jeanpierre@ 123456mayo.edu

Correspondence may also be addressed to Wei Li. Tel: +1 713 798 7854; Fax: +1 713 798 6822; Email: WL1@ 123456bcm.edu

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

Article

Publisher ID: gkt006

DOI: 10.1093/nar/gkt006

PMC ID: 3616698

PubMed ID: 23335781

SO-VID: a6195af4-36c5-4cf6-837d-bb84f9ca5bfa

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 10 October 2012

Date revision received : 30 December 2012

Date accepted : 2 January 2013

Page count

Pages: 7

Comments

Comment on this article

scite_

Cited by 817

See all cited by

Most referenced authors 947

See all reference authors

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

Read this article at

Abstract

Related collections

Genome Engineering using CRISPR

Most cited references 15

The transcriptional landscape of the mammalian genome.

RNA maps reveal new RNA classes and a possible function for pervasive transcription.

Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 30

Cited by 817

Most referenced authors 947