Functional Analysis of Pancreatic Cancer Expression Profiles

This use case shows how to perform a functional analysis of pancreatic cancer expression data with Blast2GO. The performed steps are explained more in detail with short video tutorial linked to each section.

Malignant (PDAC), benign (CP) and normal tissue (NP) gene lists are compared against each other. The human genome functional annotation data is retrieved via BioMart and used as the reference. The data set is analysed using two different enrichment-analysis strategies, a Fisher’s Exact Test and GSEA and results are visualized in various different ways within Blast2GO.

Analysis Workflow

Extract and Review data with Blast2GO.
- Load the complete human genome GO annotation by BioMart.
- Make several Project Statistics to check coverage of annotated sequences.
- Reduce the functional information with GO-Slim.

Import differential expressed gene list.
- Import two datasets from Pancreatic expression Database: Pancreatic Ductal Adenocarcinoma (PDAC) vs Normal Pancreas (NP) and Chronic Pancreatitis (CP) vs Normal Pancreas (NP).

Enrichment analysis:
1. Fisher’s Exact Test
2. Gene Set Enrichment Analysis

Methods to visualize functional profiles.
- Word Cloud
- Coloured Graph

Conclusions

1. Extract and Review functional information

First, we will obtain the complete functional annotation (GO terms) of the human genome via the BioMart import function in Blast2GO. This will be our reference set for the functional analysis. BioMart provides free software and data services in order to foster collaboration and facilitate the scientific discovery process. Once we open the BioMart wizard in Blast2GO we have to introduce the parameters of the download. In our case, we select human, the type of ID (we choose Ensemble Gene IDs and UniProt), the sequence type “Proteins” and “Whole Genome” to download all available entries.

The download may take several minutes (or hours) depending on the type of IDs and the workload of the servers. For us, Ensembl took less than an hour while the UUniProtIDs took around 12h.

Once results are obtained for both, Ensembl IDs and UniProt IDs we will compare them with B2G. A Pie Chart of the data distribution is created to observe the distribution of annotated and non-annotated sequences.

Ensembl and Uniprot Dataset

We downloaded both data sets via BioMart. For Ensemble we obtained 21472 sequences with annotation and 1312 sequences without annotation. Uniprot was 24054 annotated sequences and only 663 sequences without annotation.

Due to this difference we looked for repeated sequences using the Blast2GO feature “Find duplicate sequences” and found several hundred identical sequences in the Uniprot data set. We finally decided to continue with the Ensemble IDs since these are the IDs used in the differential expression study used later on.

GO-Slim

GO-Slim is used to obtain a more general view of the function information of the Ensembl dataset. GO-Slim is a cut-down version of the Gene Ontology (GO) containing a subset of terms in the whole GO. It gives a broad overview of the ontology content without the details of the specific, fine-grained terms. We have to choose the corresponding GO-Slim subset, according to the data and organism we are working on. In our case, Generic GO-Slim is chosen because there is no specific human subset. Once applied, the sequence table turns from blue to yellow. To evaluate the reduction, we can compare the GO-level distribution with and without GO-Slim and observe the GO-level differences.

GO level distribution without GO-Slim	Go level distribution with GO-Slim

We observe a significant difference between both. The first chart shows levels from 1 to 15 with a lot of annotations in each level, whereas in the GO-Slim chart shows a maximum level of 11 with with fewer annotated sequences at more specific levels because the majority of the sequences have moved to up to more general GO term annotations.

For a better visualization of GO-Slim annotations, we do a combined graph of molecular functions for the annotations with and without GO-Slim.


Both graphs show the molecular function (MF) GO. The left image shows the entire MF graph (DAG), from most general to most specific levels, but the second one shows only more general functions. For example: ion binding is composed of anion binding, cation binding and other functions in the first graph, but in the second one all this function is summarized in a single function: Ion binding.

If we would continue working with the GO-Slim data, we would only obtain very general information about our differential expressed genes. For that reason, GO-Slim annotation is removed/undone since we want to use the full potential of specific functions.

Video Tutorial Step 1
BioMart and GO-Slim

2. Differential expressed genes

The data used in this study has been obtained from the Pancreatic Expression Database: “Microarray-based identification of differential expressed growth- and metastasis-associated genes in pancreatic cancer” by Friess H. et al.

The aim of this study was to identify the key mechanism of deregulated molecular functions in different types of pancreatic cancer to improve the diagnosis and treatment of PDAC and the understanding of the molecular mechanisms underlying pancreatic malignant growth.

The study contains a comparison between Pancreatic Ductal Adenocarcinoma (PDAC) versus Normal Pancreas (NP) and Chronic Pancreatitis (CP) versus Normal Pancreas (NP).

The Ensambl IDs of over- and under-expressed genes for the different cancer cell types were exported into a plain text document.

3. Enrichment Analysis

We will apply to different enrichment analysis: Fisher’s Exact Test and a Gene Set Enrichment Analysis (GSEA by MIT/Broad)

3.1 Fisher’s Exact Test

The Fisher’s Exact Test is a statically test used to discover if certain functions/labels are significantly over-represented among a list of genes when compared against another list. In our case, carcinogenic cells and normal cells are going to be compared, showing us which functions are more present in the first condition, based on the number of genes related to a given function (GO-Term).

To test if a GO-Term is statistically significantly over-represented, its FDR (false discovery rate) value, the p-value after multiple testing correction, has to pass a certain cut-off.

FDR (False Discovery Rate) is the statistical value we focus on to determinate if a certain biological function is being overrepresented in one condition against the other because it controls the expected proportion of incorrectly rejected null hypotheses. FDR is considered instead of a p-value because it is adjusted to multiple testing errors and it is more restrictive and reliable in this scenario.

Procedure to follow:

We will perform the following to comparisons: PDAC vs NP and CP vs NP.

In the next section we will see the parameters available in Blast2GO:

Select Test-Set: this text file contains all gene ids, one each line, of the first condition; the test set.
Select Reference: the file is used as a reference set. This is optional and if not provided, all annotated genes/sequences of the present Blast2GO project will be considered.
GO Categories: Select which GO categories to consider.
P-Value Filter Value and Mode: the statistics mode and its cut off. We use the default parameters which are FDR with a cut off of 0.05.
Two sides: if two-sided is selected both over- and under-represented functions would be shown. This is not necessary for this use-case because we are only interested in over-represented functions.

The results of the study show a table with all the enriched functions above the given cut-off.

GO ID: the ID of the function over-represented and above the cutoff. If we click on the number, a window will appear with its information and all the sequences which encode for that function.
GO Term: the description of the GO ID.
Type: the category of the function. “C: Cellular component, “P”: Biological process, “F”: Molecular function.
FDR: the significant statistical value. The smaller the value, the more significant is the function.
Over/Under: indicated if the function is over- or under-represented in the test-set. Here all functions are going to be over-represented since we performed a one-sided test.

On the right side we can see a toolbar with several options. If we click on “Reduce to most specific”, a similar result table will appear with only the most specific functions. All parent functions which have also a significant child GO-Term will be removed. This can be useful in case of very long lists since functions will be filtered by their specificity any only the most detailed functions will remain.

Blast2GO offers several different options to visualize these enrichment results in a very user-friendly way.

Enriched graph

The Enriched Graph shows for each GO category all relevant functions, from more general to more specific. Over-represented functions are shown in red; under-represented GO-terms (nodes) are green. The statistical significance is represented by the intensity of the color. Nodes also contain information about GO IDs, P-values and the FDR value. The toolbar allows for adjusting the appearance of the graph via filters, layout and additional content option.

Bar chart

The bar chart shows the over- and under-represented functions and the corresponding sequence percentages. The sequences belonging to the test-set are in blue; sequences in the reference-set are in red. If for a given function the percentage of sequences in the reference set is higher then in the test set (red larger than blue) this function is under-represented and the other way around.

Video Tutorial Step 3.1
Fisher’s Exact Test

3.2 GSEA

The Gene Set Enrichment Analysis (GSEA) is a statistical test to search for enriched functions (gene sets) in a list of ranked genes. The goal is to determine if the member of a gene-set (e.g. a GO function annotated to a number of genes/proteins) tend to occur toward the top (or bottom) of a sorted gene list.

For doing this analysis, we have to create a text file (.rnk) with the Ensembl IDs on the first column and the statistical value (decimal values separated with dots, not commas) in the second one. For this analysis, the fold change of the gene expression values is used to sort the list of gene IDs.

As parameters, we have to provide the ranked list we want to study and other features like the permutation, plots, filter value, etc. Of course, we will have to have a Blast2GO project with our annotated sequences.

For a first run, we recommend the default parameters. Pay attention to the filter values, because 0.25 is the usually the cut off for this type of analysis (instead of 0.05). Here an FDR of 25% indicates that the result is likely to be valid 3 out of 4 times, which is reasonable in finding candidate hypothesis and using a more stringent FDR cut off may lead you to overlook potentially significant results.

GSEA results:

ES: Enrichment Score reflects the degree of over-representation for a given function (gene-set) at the extremes of the entire list. The score is calculated by walking down the list, increasing a running-sum statistic when we encounter a gene annotated to the given function and decreasing it when we encounter genes not annotated to this particular gene-set (function).
NES: Normalized Enrichment Score is the primary statistic for examining gene set enrichment results. By normalizing the enrichment score, GSEA accounts for differences in gene set size and in correlations between gene sets and the expression data set.
FDR: False Discovery Rate is the estimated probability that a given function with a given NES represents a false positive finding.
Leading Edge: Genes that appear in the ranked list at or before the point at which the running reaches its maximum deviation from zero.

The results which are in green are over-represented at the bottom of the list. That means that those functions are under-represented. Since we are interested in up-regulated functions we will review only TOP results marked in red.

If we click on the GO term of each function, here for example “blood microparticle”, we will obtain the following chart:

The top portion of the plot shows the running Enrichment Score for the gene set (function) as the analysis walks down the ranked list. The score at the peak of the plot (the score furthest from 0.0) is the Enrichment Score for the function.
The middle portion of the plot shows where the members of the set appear in the ranked list of genes.The Leading edge subset for a positive ES is the set of members that appear in the ranked list prior to the peak score
The bottom portion of the plot shows the value of the ranking metric as you move down the list of ranked genes. The ranking metric measures a gene’s correlation with a phenotype. The value of the ranking metric goes from positive to negative as you move down the ranked list. A positive value indicates correlation with the first phenotype.

In addition to the previous plot we also obtain the “Random ES Distribution” and other GSEA details like a list of genes that contributed to the given GO ID, their list rank and metric score, the Running ES and the Core Enrichment. From the GSEA side-bar, we can create further graphs which show us the distribution of the data and statistical values.

Video Tutorial Step 3.2
GSEA

4. Methods to Visualize Functional Profiles

Blast2GO has several options to visualize list of functional terms.

Word Cloud

The “Word Cloud” feature creates a graphical representation of the most important functions shown in different colors and font sizes depending on its importance. This is an easy way to highlight important functions may be suitable for a more informal presentation. There are several options in the toolbar to change the graphical appearance like the number of words, the orientation and the shape of the cloud as well as the color scheme.

Coloured Graph

The Colored Graph generates a GO graph from a text file which contains a list of GOs, a group name and numeric value. The text file has to follow a simple structure, to be processed correctly. It has to contain three columns: the first one has to contain a GO ID, the second a number which represents the statistical value and the third column contains the group to which the function belongs. The columns must be separated with a tabulator character. The structure is shown in the following table.

We created a Colored Graph with the Top 10 Molecular Functions for both types of cancer. In the Color Wizard, we choose red for functions of the malignant tissue and blue for benign.

	1	2	3
	#GO	Value	Group
1	GO:0005178	1.0E-10	Malignant
2	GO:0005201	3.2E-09	Malignant
3	GO:0005201	5.4E-14	Benign
4	GO:0008201	2.0E-06	Benign

This results in the following visualization.


	Red node: only the malignant cancer has that function. Blue node: only the benign cancer has that function. Double coloured node: over/expressed genes from both cancer types significantly over-represented for these functions

Different charts can also be created from the Colored Graph to visualize results: Level Pie Chart and Bar Chart, and Multilevel Pie Chart and Bar Chart. The levels of the chart mean the specificity of the GO terms, the higher the number, the more specific is the function. The multilevel chart shows that significant functions from the general levels to the specific ones will be shown. For example:

Level 2 Pie Chart	Multilevel Pie Chart	Level 2 Bar Chart

The Level 2 Pie Chart provides the functions more general of the data set.	The Multi-level chart provides functions from different levels.	The Bar Chart provides the same information as the Pie Chart.

In the Coloured Graph toolbar we have some option to visualize the significant functions for each GO categories and adjust the parameters to change the appearance of the graph. Different summary charts are available.

Video Tutorial Step 4
Coloured Graph

5. Conclusions

In this use-case, we used a public human cancer data set and performed a complete functional analysis within Blast2GO. We obtained differential expressed genes from the Pancreatic Expression Database (PED) and the latest Ensembl Human Genome functional annotation with Blast2GO via BioMart. As a first step, we analysed the functional information via GO-Slim. GO-Slim is useful to obtain a general view of the data and allows to quickly summarize a set of Gene Ontology terms at a more general and uniform way. Once we prepared the ID lists for over- and under-expressed genes in the different tissue and cancer types we performed two different types of enrichment analysis: a Fisher’s Exact Test and a Gene Set Enrichment Analysis (GSEA). The Fisher’s Exact Test (FET) is a very useful statistical test to find significantly over-represented functions in a pairwise comparison (two groups of genes/proteins). GSEA allows identifying enriched functions without defining a test and reference set but by providing a ranked list. For this use case, we ordered our gene list by gene the expression fold-change.

The results of the FET and GSEA analysis are shown below. Enriched functions have been identified comparing cancer vs. normal tissue.

FET: We observe similarities between functions identified for malignant and benign cancer. Comparing both types of cancers, we can observe that malignant cancer has more complex functions, one of the most significant is glucuronidation. This function provides cancerous cells resistance to chemotherapy or other drug treatment because it facilitated the excretion of a certain substance.

GSEA: GSEA showed less enriched functions and we obtained only one significant result when comparing the list ranked by fold-change between genes of malignant and normal tissue. However, when analyzing the ranked list of benign vs. normal tissue we detected 3 over-represented functions. “Blood Microparticle” and “Heparin Binding” with a very low, highly significant (less than 0.05) FDR value and “Proteinaceous extracellular matrix” with an FDR close to the cutoff of 0.25. The latter may potentially be a false positive and should be treated with caution; manual review recommended.

Once we obtained these functional profiles we used several visualization methods to review the results. Especially the Colored Graph offered a good way to compare the functional characteristic of benign and malign cancer tissues.

Note: Most steps performed during this analysis are explained more in detail in the video tutorial linked to each section.

Fisher’s Exact Test: PDCA vs NP
GO term	Type	FDR
Extracellular exosome	C	8.1E-35
Extracellular matrix dissasembly	P	7.7E-15
Xenobiotic glucuronidation	P	2.5E-1
Negative regulation of cellular glucuronidation	P	2.8E-12
Integrin binding	F	1.0E-10

Fisher’s Exact Test: CP vs NP
GO term	Type	FDR
Extracelullar exosome	C	4.4E-25
Extracellular matrix dissasembly	P	1.1E-15
Extracellular matrix structural constituent	F	5.4E-14
Collagen catabolic process	P	3.3E-13
Integrin binding	F	2.1E-9

GSEA: PDCA vs NP
GO term	Type	FDR
Blood microparticle	C	1.6E-1

GSEA: CP vs NP
GO term	Type	FDR
Blood microparticle	C	9.9E-3
Heparin binding	F	5.0E-2
Preoteinaceous extracellular matrix	C	2.5E-1

About the author:

Lydia Fortea, Bachelor Student, Biotechnology, Valencia

I wrote this use-case during my 3 month internship at BioBam. As a biotechnology student at the Polytechnic University of Valencia (UPV) this was a great opportunity to introduce myself into the world of bioinformatics. It was a lot of fun to work with the Blast2GO Team and I had a good time learning and created these video tutorials.

PS: If this tutorial was useful for you please share and/or like it below!

Blog Categories:

News

Releases, Media, Announcements, etc.

Use Cases, Reviews, Tutorials

Product Tutorial, Quickstarts, New Features, etc.

Video Tutorials

Helpful Features, Tips and Tricks

Tips And Tricks

Mini-tutorials for common use-cases and to address frequently asked questions FAQs

Analysis Workflow

1. Extract and Review functional information

2. Differential expressed genes

3. Enrichment Analysis

3.1 Fisher’s Exact Test

3.2 GSEA

4. Methods to Visualize Functional Profiles

5. Conclusions

About the author:

Blog Categories:

Most Popular:

Differences between Fisher’s Exact Test and GSEA for Functional Enrichment

Reference-free analysis of long-read RNA sequencing for non-model species

OmicsBox in a New Scientific Publication on Sorghum Stress Responses

LongTREC Secondments at BioBam: Advancing Long-Read Transcriptomics Through Collaboration

Cell Type Prediction with CellKb

Company

OmicsBox

Blog

Info

Security