Most transcripts assembled from eukaryotic and prokaryotic RNA-Seq data are expected to code for proteins. The most practical procedure to identify likely coding transcripts is a sequence homology search, such as by BLASTX, against sequences from well-annotated and related species. Predicting coding regions is crucial to determine the molecular role that transcripts play in the cell. Unfortunately, such well-annotated nearby species are often not available for transcriptomes of newly sequenced transcriptomes.
When we work with non-model organisms, the reference genome and transcriptome might not be available, so transcriptome assembly requires de novo strategies. These newly targeted transcriptomes generally encode proteins that are insufficiently represented by detectable homologies to known proteins. To capture those coding regions, we need methods that predict coding regions based on metrics tied to sequence composition, such as TransDecoder. This tool to Predict Coding Regions is available in the Transcriptomics Module in OmicsBox.
Predict Coding Regions with TransDecoder
TransDecoder in OmicsBox
The TransDecoder methodology is available in OmicsBox via the
CDSs: Nucleotide sequences for coding regions of the final candidate ORFs, in FASTA format. Proteins: Peptide sequences for the final candidate ORFs, in FASTA format. Coordinates: Positions within the target transcripts of the final selected ORF, in GFF format. We recommend the OmicsBox Genome Browser for viewing the candidate ORFs in the context of the transcriptome (Fig. 1).
ORF Types
There are a few items to pay attention to in the above files. TransDecoder provides details about the predicted ORFs, such as the length and the strand in which the coding region was found. Furthermore, it classifies predicted ORF according to the start and stop signals (Fig. 2):
-
Complete ORF: Contains a start and a stop codon.
-
5′ partial ORF: It lacks the start codon and presumably part o the N-terminus.
-
3′ partial ORF: It lacks the stop codon and presumably part of the C-terminus.
-
Internal ORF: It is both 5′ and 3′ partial.
In practice, after applying this strategy, Omicsbox extracts the most confident ORFs, used to predict protein functions. In OmicsBox, this step can be directly linked to the homology-based functional annotation pipeline, which uses the widely known Blas2GO methodology (Fig. 3).
Example Use Case
Reanalyzing the A. galli transcriptomic response to an anthelmintic drug with OmicsBox
References
- Predict Coding Regions User Manual.
- Haas BJ et al. (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature protocols, 8(8), 1494-512.
- TransDecoder 5.5.0. Haas, BJ. and Papanicolaou, A. 2019. https://github.com/TransDecoder/TransDecoder/wiki.
About the Author
With a biological and technological academic background, including a BSc in Biotechnology and an MSc in Bioinformatics, Enrique’s expertise lies in the areas of Long Reads and Genetic Variation.