This article explains what a Gene Set Enrichment Analysis (GSEA) is, how it works and how to use it with OmicsBox.
What is an enrichment analysis?
Enrichment analyses are a family of bioinformatics methods that aim to facilitate the biological interpretation of many bioinformatics results. Among these, the methods based on gene sets analysis are particularly helpful and widespread.
The goal of gene sets methods is to identify enriched or over-represented gene sets among a list of genes. These gene sets are groups of functionally related genes, according to current knowledge. Commonly used sets of genes represent biological functions like gene ontology terms and pathways. However, they also can indicate a common relation like a disease, chromosomal location, or expression regulation.
How works a Gene Set Enrichment Analysis (GSEA)?
GSEA is a computational method to determine whether a priori-defined set of genes shows a statistically significant difference between biological samples. This method allows identifying groups of genes over-represented in a more extensive list of ranked genes. These sets may be associated with biological functions or disease phenotypes. For this reason, the procedure uses statistical approaches to identify significantly enriched or depleted classes or functions.
The standard GSEA method involves three steps in the analytical process:
-
Calculation of the enrichment score (ES): This metric represents the degree to which a gene set is over-represented at either the top or bottom of a ranked list of genes.
-
Estimation of the statistical significance of the ES: Calculated by performing a phenotypic-based permutation test to produce a null distribution for the ES.
-
Adjustment for Multiple Hypothesis Testing: The enrichment scores for each set are normalized, and a false discovery rate is calculated to prevent type I errors (false positives).
A gene set enrichment analysis uses specific statistics and requires the corresponding implementations to run the analysis.
OmicsBox makes it very easy to perform a Gene Set Enrichment Analysis (GSEA)
OmicsBox is a complete bioinformatics solution developed by Biobam. This platform includes Gene Set Enrichment Analysis (GSEA), among many other tools.
Our GSEA implementation uses the GSEA software package developed by the MIT/BROAD Institute. Its integration in OmicsBox enables efficiently running the analysis and reviewing the results, allowing you to focus on its interpretation.
The steps on how to perform a gene set enrichment analysis (GSEA) with OmicsBox are explained in this short video.
Also, the video shows how to identify enriched functions from a tissue comparison performing GSEA with OmicsBox. Every GSEA run requires a ranked list of functionally annotated genes. OmicsBox allows the creation of this list in different ways:
-
One option allows us to load the list of IDs and numeric values into a spreadsheet and save it as a text file.
-
Another option is to directly use the differential expression results from OmicsBox by ranking the genes according to their fold change and significance level.
Running a Gene Set Enrichment Analysis in OmicsBox
Launching the analysis
To start the GSEA, you have to load the functional annotations of your genes/proteins that need to match the IDs of your ranked list. Alternatively, you can load your pairwise differential expression results, and OmicsBox will create the ranked list from your results.
Once the OmicsBox project is loaded and the ranked list created, you are ready to run the enrichment analysis. Then provide the analysis parameters and hit run:
-
Specify the number of gene set permutations.
-
Select the Enrichment statistic to calculate the ES.
-
Choose the Gene Ontology categories you want to use.
-
Set a maximum and minimum size of the gene sets (GOs) to be included in the analysis.
-
Select the filter mode an How to run the cut-off.
Retrieving the results
When the analysis concludes, you will obtain a result table that shows all significantly over-represented functions among the IDs at the top and bottom of your ranked list. Additionally to the GO ID and GO term of each function, the results provide many details:
-
The first column indicates whether a GO is enriched at the top or the bottom of the ranked list.
-
The Enrichment Score (ES) reflects the degree of over-representation of a GO at the extremes of the ranked list.
-
The normalized ES is the primary statistic for this type of enrichment result.
-
The FDR q-value is the adjusted p-value, a statistical value adjusted for multiple testing.
-
The FWER q-value is a more conservative adjusted p-value than the FDR q-value.
Graphs and more detailed results
A new page provides more details by right-clicking on the GO IDs, like the GO description and GSEA result details. Accordingly, an“enrichment plot” provides a graphical view of all gene sets’ enrichment scores (ES).
The enrichment plot shows a green line representing the running ES for a given GO as the analysis goes down the ranked list. The value at the peak is the final ES. The middle part shows where the members (GOs) of the dataset appear in the ranked list. Those genes that occur at or before the ES represent the Leading Edge Subset. The lower part shows the value of the ranking metric as it moves down the list of the ranked genes.
Results management from the side panel
The result page has a toolbar with several options, like creating charts, filtering the results, or saving them as a text file. The option ‘Reduce to most specific’ allows to filter the results based on their specificity; ‘Make an enrichment graph’ generates a GO graph for each GO category selected in the wizard, and ‘Show global statistics’ generates different statistical graphs.
These visualizations should assist you in interpreting the results, finding biological meaning, and communicating your findings.
If you want to try all this yourself, you can download OmicsBox.