Introduction
Given a new genome, one of the first and most important tasks is determining the structure of its protein-coding genes. Ab initio gene prediction algorithms play a critical role because they produce gene structures quickly, inexpensively, and remarkably reliable. In OmicsBox, the Eukaryotic Gene Finding application is based on AUGUSTUS, which is one of the most accurate programs for the species for which it is trained. AUGUSTUS can be used as an ab initio program since it includes pre-trained models for over 100 species. AUGUSTUS may also incorporate hints on the gene structure coming from extrinsic sources. The original publication states:
AUGUSTUS is a tool for finding protein-coding genes and their exon-intron structure in genomic sequences. It does not necessarily require additional experimental input, as it can be applied in so-called ab initio mode. However, extrinsic evidence from various sources such as transcriptome sequencing or the annotations of closely related genomes can be integrated in order to improve the accuracy and completeness of the annotation.
Since OmicsBox 2.0, the Eukaryotic Gene Finding (2nd version) bioinformatic pipeline based on Augustus is available in the Genome Analysis Module.
Eukaryotic Gene Finding with OmicsBox
- The Eukaryotic Gene Finding application accepts genomic sequences in FASTA or multi-FASTA format. Genomic sequences can be provided in the form of contigs, scaffolds, or chromosomes.
- The Eukaryotic Gene Finding application can incorporate hints on the gene structure coming from extrinsic sources: RNA-Seq, proteins, EST/cDNA, and IsoSeq data. These extrinsic sources are expected in common bioinformatic formats (FASTA, FASTQ…).
- Output gene predictions are returned in different forms: GFF coordinates, CDS sequences, and protein sequences. This is complemented with additional results, such as charts and reports, which help to interpret them.
Note: Each type of extrinsic evidence source is processed with specific appropriate software for it. Extrinsic evidence sources must be curated to ensure the most accurate prediction.
Note: Predicted gene sequences can be functionally characterized by taking advantage of the OmicsBox Functional Analysis module.