get-started
Thomas Rauter
20 September, 2024
get-started.Rmd
About this tutorial
This tutorial intends to showcase and explain the capabilities of the SplineOmics package by walking through a real and complete example, from start to finish.
Example Overview
The example involves a time-series proteomics experiment, where CHO (chinese hamster ovary) cells were cultivated in three bioreactors (three biological replicates). The experiment includes the following setup:
- Samples were taken both during the exponential and stationary growth phases.
- Samples were collected in triplicates from each reactor at defined
timepoints relative to cell feeding:
- 60 minutes before feeding
- 15, 60, 90, 120, and 240 minutes after feeding
Analysis Goals
The main goals of this analysis are:
- Identify proteins with significant temporal changes: Out of 7162 cellular proteins, the objective is to detect which proteins show a significant change over time after the CHO cells were fed (i.e., the impact of the feeding).
- Cluster hits based on temporal patterns: The proteins (hits) with significant temporal changes will be clustered according to their time-based patterns.
- Perform gene set enrichment analysis: For each cluster, a gene set enrichment analysis will be performed to determine if specific biological processes are up- or downregulated after feeding.
Note
The documentation of all the SplineOmics package functions can be viewed here
Load the packages
library(SplineOmics)
library(readxl) # for loading Excel files
library(here) # For managing filepaths
#> here() starts at /home/thomas/Documents/PhD/projects/DGTX/R_packages/SplineOmics
library(dplyr) # For data manipulation
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Load the files
In this example, the proteomics_data.rds file contains the numeric values (the intensities) and also the feature descriptions, such as gene and protein name (= annotation part). Usually, you would load the data from for example an Excel file, but the .rds file is more compressed, which is the reason this format was chosen here to limit the size of the SplineOmics package.
The file meta.xlsx contains the meta information, which are the descriptions of the columns of the numeric values of data.
(These example files are part of the package and don’t have to be present on your system).
Please note that this dataset is an actual experimental dataset, but the annotation information, such as gene names, has been removed since it was not yet published at the time of making the SplineOmics package public. Instead, the dataset includes randomly generated gene symbols and gene names corresponding to Cricetulus griseus (Chinese Hamster) for each row. This is intended to demonstrate the functionality of the package.
The left part of data contains the numeric values, and the right part the annotation info, which can be copied in a separate dataframe, as shown below.
data <- readRDS(system.file(
"extdata",
"proteomics_data.rds",
package = "SplineOmics"
))
meta <- read_excel(
system.file(
"extdata",
"proteomics_meta.xlsx",
package = "SplineOmics"
)
)
# Extract the annotation part from the dataframe.
first_na_col <- which(is.na(data[1,]))[1]
annotation <- data |>
dplyr::select((first_na_col + 1):ncol(data)) |>
dplyr::slice(-c(1:3))
print(head(data))
#> # A tibble: 6 × 40
#> `Sample ID` `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11`
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Reactor E09 E10 E12 E09 E10 E12 E09 E10 E12 E09 E10
#> 2 Time Point TP01 TP01 TP01 TP02 TP02 TP02 TP03 TP03 TP03 TP04 TP04
#> 3 Phase of Fe… Expo… Expo… Expo… Expo… Expo… Expo… Expo… Expo… Expo… Expo… Expo…
#> 4 NA 15.2… 15.2… 15.3… 15.1… 15.2… 15.0… 15.2… 15.2… 15.2… 15.1… 15.2…
#> 5 NA 15.0… 15.1… 15.2… 15.1… 15.1… 15.2… 15.2… 15.3… 15.2… 15.1… 15.1…
#> 6 NA 14.5… 14.7… 14.6… 14.5… 14.6… 14.6… 14.5… 14.6… 14.7… 14.5… 14.6…
#> # ℹ 28 more variables: `12` <chr>, `13` <chr>, `14` <chr>, `15` <chr>,
#> # `16` <chr>, `17` <chr>, `18` <chr>, `19` <chr>, `20` <chr>, `21` <chr>,
#> # `22` <chr>, `23` <chr>, `24` <chr>, `25` <chr>, `26` <chr>, `27` <chr>,
#> # `28` <chr>, `29` <chr>, `30` <chr>, `31` <chr>, `32` <chr>, `33` <chr>,
#> # `34` <chr>, `35` <chr>, `36` <chr>, ...38 <lgl>, Gene_symbol <chr>,
#> # Gene_name <chr>
print(meta)
#> # A tibble: 36 × 5
#> Sample.ID Reactor Time.Point Phase Time
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 E09_TP01_Exponential E09 TP01 Exponential -60
#> 2 E10_TP01_Exponential E10 TP01 Exponential -60
#> 3 E12_TP01_Exponential E12 TP01 Exponential -60
#> 4 E09_TP02_Exponential E09 TP02 Exponential 15
#> 5 E10_TP02_Exponential E10 TP02 Exponential 15
#> 6 E12_TP02_Exponential E12 TP02 Exponential 15
#> 7 E09_TP03_Exponential E09 TP03 Exponential 60
#> 8 E10_TP03_Exponential E10 TP03 Exponential 60
#> 9 E12_TP03_Exponential E12 TP03 Exponential 60
#> 10 E09_TP04_Exponential E09 TP04 Exponential 90
#> # ℹ 26 more rows
print(annotation)
#> # A tibble: 4,162 × 2
#> Gene_symbol Gene_name
#> <chr> <chr>
#> 1 LOC113838844 cone-rod homeobox protein-like
#> 2 Wdr83os WD repeat domain 83 opposite strand
#> 3 Cubn cubilin
#> 4 Dynlt1 dynein light chain Tctex-type 1
#> 5 Ostc oligosaccharyltransferase complex non-catalytic subunit
#> 6 Unc5cl unc-5 family C-terminal like
#> 7 Cfl1 cofilin 1
#> 8 LOC100752202 HEN methyltransferase 1
#> 9 LOC100755162 acyl-coenzyme A synthetase ACSM5, mitochondrial
#> 10 LOC100768921 40S ribosomal protein S21
#> # ℹ 4,152 more rows
Two comments about the characteristics the input data should have:
The data must not contain any NA values or other special values, and must consist only of numbers. For example, the original proteomics data contained some NA values, which were resolved in this case by imputation (replacing NA values with numbers).
All features of the data should ideally be normally distributed when analyzed with
limma
, which fits a linear model to each feature. These models rely on statistical tests that assume normality. Althoughlimma
can still function if the data is not normally distributed, the resulting p-values may become less reliable. For this reason, it is strongly recommended to transform your data using techniques such as log2 transformation when the features deviate from normality. Proper transformation helps ensure that the assumptions underlying the statistical tests are met, leading to more accurate and trustworthy results.
Bring the Inputs into the Standardized Format
Since data
is not in the format required by the
SplineOmics package, it needs some processing. The
SplineOmics package requires data to be a numeric matrix, so no element
is allowed to be anything else than a number. This can be done with a
few commands in R, but if your file has a specific structure, the
function extract_data()
can handle this automatically.
File Structure Requirements
If your file looks like the one used here, where:
- The data matrix field is on the left
- The annotation info is on the right
- These fields are separated by one empty column
Usage of the extract_data() function
Then, extract_data()
can:
- Identify the data matrix field and return it as a numeric matrix.
- Create column headers from the information written in the cells above the respective columns of the data matrix field.
-
Assign rowheaders:
- If no annotation columns are specified, rowheaders will be increasing numbers.
- If annotation columns are specified (like
"First.Protein.Description"
and"ID"
in this example), these will be combined to form the rowheaders (feature names).
Usage in Plotting
The generated rowheaders will be used to label any plots where a feature is shown individually, such as:
- Spline plots with the datapoints from an individual feature.
data <- SplineOmics::extract_data(
# The dataframe with the numbers on the left and info on the right.
data = data,
# Use this annotation column for the feature names.
feature_name_columns = c("Gene_name"),
# When TRUE, you must confirm that data is in the required format.
user_prompt = FALSE
)
Perform EDA (exploratory data analysis)
Now that we have the data in the required format (numeric matrix) we can go on.
The first step in analyzing data is typically Exploratory Data Analysis (EDA). EDA involves summarizing the main characteristics of the data, often through visualizations.
Common EDA Plots
Some common types of EDA plots include:
- Density distributions
- Boxplots
- PCA (Principal Component Analysis)
- Correlation heatmaps
Again, you can generate those plots yourself with a few lines of R
code. However, if you prefer, for convenience, the
explore_data()
function can handle this for you.
Using explore_data()
for EDA
The SplineOmics package provides the function
explore_data()
to perform EDA. This function requires the
following arguments:
- data: The numeric data matrix.
- meta: The metadata table.
- condition: The name of the column in the metadata that contains the levels of the experiment (e.g., “Exponential” and “Stationary”).
- report_info: A list that contains general information about the analysis, such as the name of the analyst and the datatype (e.g. proteomics)
Optional Arguments
In addition to the required arguments, explore_data()
offers several optional arguments:
meta_batch_column: The name of the column that contains the first batch effect.
-
meta_batch2_column: The name of the column that contains the second batch effect.
If at least one batch column is provided, the function will:
- Use the
removeBatchEffect()
function from limma to remove the batch effect from the data before plotting. - Generate two EDA HTML reports: one for the uncorrected data and one for the batch-corrected data.
- Use the
Output and Report Options
- By default, the reports are saved in the current working
directory, but this location can be changed using the
report_dir
argument. - The function also returns all plots generated during the analysis, so that you can modify them according to your own needs.
- If you do not want a report to be generated, you can set the
report
argument toFALSE
(when you for example just want the figures in the R environment)
# Those fields are mandatory, because we believe that when such a report is
# opened after half a year, those infos can be very helpful.
report_info <- list(
omics_data_type = "PTX",
data_description = "Proteomics data of CHO cells",
data_collection_date = "February 2024",
analyst_name = "Thomas Rauter",
contact_info = "thomas.rauter@plus.ac.at",
project_name = "DGTX"
)
report_dir <- here::here(
"results",
"explore_data"
)
SplineOmics Object
In the SplineOmics package, multiple functions take the same arguments as input. To make this easier and to avoid errors, we decided that those arguments are not provided individually to the functions, but are all stored in an R6 object (which is of type ‘SplineOmics’) and then this object is passed to the functions. Additionally, some functions generate intermediate output, which is just necessary for the next function in the workflow, which is then also just passed along by updating the SplineOmics object. But you don’t have to worry about this.
Functionality
The SplineOmics object can be seen as a container where all necessary arguments are stored. Each function retrieves the required arguments from the object and potentially adds new data or results back into it.
Documentation
The documentation of the function that creates the SplineOmics object can be found here and the documentation of the function that updates it [here
The documentation for each function that takes the SplineOmics object as input specifies which arguments must be present in the SplineOmics object when it is passed to the respective function.
Required Arguments create_splineomics()
- data: A matrix with the data
- meta: Metadata associated with the data.
- condition: Meta column name of the levels (e.g., Exponential and Stationary).
Optional Arguments create_splineomics()
-
rna_seq_data: An object containing the preprocessed
RNA-seq data, such as the output from
limma::voom
function. - annotation: A dataframe with the feature descriptions of data.
- report_info: A list containing general information about the analysis.
- meta_batch_column: Column for meta batch information.
- meta_batch2_column: Column for secondary meta batch information.
- design: A limma design formula
- spline_params: Parameters for the spline functions.
# splineomics now contains the SplineOmics object.
splineomics <- SplineOmics::create_splineomics(
data = data,
meta = meta,
annotation = annotation,
report_info = report_info,
condition = "Phase", # Column of meta that contains the levels.
meta_batch_column = "Reactor" # For batch effect removal
)
# Special print.SplineOmics function leads to selective printing
print(splineomics)
#> data:SplineOmics Object
#> -------------------
#> Number of features (rows): 4162
#> Number of samples (columns): 36
#> Meta data columns: 5
#> First few meta columns:
#> # A tibble: 3 × 5
#> Sample.ID Reactor Time.Point Phase Time
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 E09_TP01_Exponential E09 TP01 Exponential -60
#> 2 E10_TP01_Exponential E10 TP01 Exponential -60
#> 3 E12_TP01_Exponential E12 TP01 Exponential -60
#> Condition: Phase
#> No RNA-seq data provided.
#> Annotation provided with 4162 entries.
#> No spline parameters set.
#> P-value adjustment method: BH
Now that we have the SplineOmics object defined, we can perform our exploratory data analysis.
plots <- SplineOmics::explore_data(
splineomics = splineomics, # SplineOmics object
report_dir = report_dir
)
Here you can see the HTML report of the explore_data() function with the NOT batch-corrected data, and here the report for the batch-corrected data.
The EDA plots can tell you a range of things. The plots in the HTML report are grouped into three categories: Distribution and Variability Analysis, Time Series Analysis, and Dimensionality Reduction and Clustering.
If you look at the correlation heatmaps in the HTML report, you can see that the samples E12_TP05_Exponential and E10_TP10_Stationary stick out. Seeing this, you might want to remove them from the data. You can test out what happens when you do this, along with testing how other hyperparameter choices influence the results, with the package function screen_limma_hyperparams().
Finding the Best Hyperparameters
Before running the limma spline analysis, it is important to find the best “hyperparameters”. In this context, hyperparameters include:
- Degree of freedom (DoF)
- Different versions of the data (e.g., outlier removed vs. not removed)
- Different limma design formulas
Challenge of Hyperparameter Selection
Rationally determining the best combination of hyperparameters can be
very challenging. By rationally, I mean deciding upon the final
hyperparameters without ever testing any, just by scientific reasoning.
It is much easier just testing a few and seeing how they actually
behave. However, manually selecting combinations can be tedious, and you
have to work very systematically, which can be challenging. To solve
this problem, the screen_limma_hyperparams()
function was
written.
Using screen_limma_hyperparams()
The function screen_limma_hyperparams()
automates the
process of testing different combinations of hyperparameters. Here’s how
it works:
- Specify values: For each hyperparameter, you can specify all the values you want to test.
- Run combinations: The function runs the limma spline analysis with combinations formed from the hyperparameters you’ve provided in a semi combinatorial way.
Inner vs. Outer Hyperparameters
Semi combinatorial here means that not every possible combination is generated. Instead, there are inner and outer hyperparameters:
-
Outer hyperparameters: These include things like
different versions of the dataset (e.g., full dataset
vs. dataset with outliers removed).
- All possible combinations of outer hyperparameters are generated.
-
Inner hyperparameters: These include
adjusted p-value thresholds and spline
parameters (e.g., degree of freedom).
- For each version of the data (outer hyperparameter), all combinations of inner hyperparameters are tested.
This approach is neccessary, because otherwise the amount of combos would explode.
Example
For example, if you have two versions of a dataset (one full dataset, and one with some outliers removed), these versions are considered outer hyperparameters. Additionaly, lets say, you want to test two different limma design formulas, formula 1 and 2. The function will test out all combinations of those outer hyperparameters and compare them with each other, which results in a total of 6 combinations here:
Full Dataset Formula 1 vs Full Dataset Formula 2
Full Dataset Formula 1 vs Outliers Removed Dataset Formula 1
Full Dataset Formula 1 vs Outliers Removed Dataset Formula 2
Full Dataset Formula 2 vs Outliers Removed Dataset Formula 1
Full Dataset Formula 2 vs Outliers Removed Dataset Formula 2
Outliers Removed Dataset Formula 1 vs Outliers Removed Dataset Formula 2
Let’s say you specified the following inner hyperparameters:
- Spline parameters: Natural cubic splines with a degree of freedom of either 2 or 3.
- Adjusted p-value threshold: 0.05 or 0.1.
The function will generate and test all combinations of the spline parameters and p-value thresholds for all 4 combos:
Combo 1:
DoF = 2, threshold = 0.05
DoF = 3, threshold = 0.05
DoF = 2, threshold = 0.1
DoF = 3, threshold = 0.1
Combo 2:
DoF = 2, threshold = 0.05
DoF = 3, threshold = 0.05
DoF = 2, threshold = 0.1
DoF = 3, threshold = 0.1
Combo 3: …
This allows you to systematically explore different combinations and select the optimal hyperparameters for your analysis.
Below is an example for our proteomics data:
data1 <- data
meta1 <- meta
# Remove the "outliers"
data2 <- data[, !(colnames(data) %in% c(
"E12_TP05_Exponential",
"E10_TP10_Stationary"
)
)]
# Adjust meta so that it matches data2
meta2 <- meta[!meta$Sample.ID %in% c(
"E12_TP05_Exponential",
"E10_TP10_Stationary"
), ]
# As mentioned above, all the values of one hyperparameter are stored
# and provided as a list.
datas <- list(data1, data2)
# This will be used to describe the versions of the data.
datas_descr <- c(
"full_data",
"outliers_removed"
)
metas <- list(meta1, meta2)
# Test two different limma designs
designs <- c(
"~ 1 + Phase*X + Reactor",
"~ 1 + X + Reactor"
)
# 'Integrated means' limma will use the full dataset to generate the results for
# each condition. 'Isolated' means limma will use only the respective part of
# the dataset for each condition. Designs that contain the condition column
# (here Phase) must have mode 'integrated', because the full data is needed to
# include the different conditions into the design formula.
modes <- c(
"integrated",
"isolated"
)
# Specify the meta "level" column
condition <- "Phase"
report_dir <- here::here(
"results",
"hyperparams_screen_reports"
)
# To remove the batch effect
meta_batch_column = "Reactor"
# Test out two different p-value thresholds (inner hyperparameter)
adj_pthresholds <- c(
0.05,
0.1
)
# Create a dataframe with combinations of spline parameters to test
# (every row a combo to test)
spline_test_configs <- data.frame(
# 'n' stands for natural cubic splines, b for B-splines.
spline_type = c("n", "n", "b", "b"),
# Degree is not applicable (NA) for natural splines.
degree = c(NA, NA, 2L, 4L),
# Degrees of freedom (DoF) to test.
# Higher dof means spline can fit more complex patterns.
dof = c(2L, 3L, 3L, 4L)
)
print(spline_test_configs)
#> spline_type degree dof
#> 1 n NA 2
#> 2 n NA 3
#> 3 b 2 3
#> 4 b 4 4
Now that we specified all the values for each hyperparameter that we
want to test, we can run the screen_limma_hyperparams()
function.
SplineOmics::screen_limma_hyperparams(
splineomics = splineomics,
datas = datas,
datas_descr = datas_descr,
metas = metas,
designs = designs,
modes = modes,
spline_test_configs = spline_test_configs,
report_dir = report_dir,
adj_pthresholds = adj_pthresholds,
)
As mentioned, this function generates a report for each comparison of the outer hyperparameters, which are too many to show here. You can view an example report here
This report contains the results for the comparison of the “outer” hyperparameters data 1 and design (formula) 1 compared against data 1 and design 2. For both of those, all combinations of the “inner” hyperparameters are generated (every possible combination of all specified adj. p-value thresholds and spline configs).
The encoding used in the reports and the titles is here (This is part of the output of the screen_limma_hyperparams function).
Run limma spline analysis
Once we identified the hyperparameters that are likely the best ones, we can run the limma spline analysis with them and get the results.
Lets just assume for now that the new parameters, with which the SplineOmics object is updated, are the best for this analysis. The choice depends on the analysis. For example, for this analysis, natural cubic splines (n) with a dof of two seemed to fit the data best (not overfitting, but also not underfitting), which was the reason those spline parameters were chosen.
For the design formula, you must specify either ‘isolated’ or ‘integrated’. Isolated means limma determines the results for each level using only the data from that level. Integrated means limma determines the results for all levels using the full dataset (from all levels). For the integrated mode, the condition column (here Phase) must be included in the design. Isolated means that limma only uses the part of the dataset that belongs to a level to obtain the results for that level.
To generate the limma result categories 2 and 3 ()
splineomics <- SplineOmics::update_splineomics(
splineomics = splineomics,
design = "~ 1 + Phase*X + Reactor", # best design formula
mode = "integrated", # means limma uses the full data for each condition.
data = data2, # data without "outliers" was better
meta = meta2,
spline_params = list(
spline_type = c("n"), # natural cubic splines (take these if unsure)
dof = c(2L) # If you are unsure about which dof, start with 2 and increase
)
)
Run the run_limma_splines()
function with the updated
SplineOmics object:
splineomics <- SplineOmics::run_limma_splines(
splineomics = splineomics
)
#> Column 'Reactor' of meta will be used to remove the batch effect for the plotting
#> Info limma spline analysis completed successfully
The output of the function run_limma_splines() is a named list, where each element is a specific “category” of results. Refer to this document for an explanation of the different result categories. Each of those elements is a list, containing as elements the respective limma topTables, either for each level or each comparison between two levels.
The element “time_effect” is a list, where each element is the topTable where the p-value for each feature for the respective level are reported.
The element “avrg_diff_conditions” is a list that contains as elements the topTables, that represent the comparison of the average differences of the levels.
The element “interaction_condition_time” is a list that contains as elements the topTables, that represent the interaction between the levels (which includes both time and the average differences)
Build limma report
The topTables of all three limma result categories can be used to generate p-value histograms an volcano plots.
report_dir <- here::here(
"results",
"create_limma_reports"
)
plots <- SplineOmics::create_limma_report(
splineomics = splineomics,
report_dir = report_dir
)
You can view the generated analysis report of the create_limma_report function here.
This report contains p-value histograms for all three limma result categories and a volcano plot for category 2. Embedded in this file are the downloadable limma topTables with the results for category 1 if the mode was ‘isolated’ and also with the results of category 2 and 3 if the mode was ‘integrated’.
Note that in the upcoming cluster_hits() function report, the embedded file will only contain the clustered significant features from result category 1.
Cluster the hits (significant features)
After we obtained the limma spline results, we can cluster the hits based on their temporal pattern (their spline shape). We define what a hit is by setting an adj. p-value threshold for every level. Hits are features (e.g. proteins) that have an adj. p-value below the threshold. Hierarchical clustering is used to place every hit in one of as many clusters as we have specified for that specific level.
adj_pthresholds <- c( # 0.05 for both levels
0.05, # exponential
0.05 # stationary
)
clusters <- c(
6L, # 6 clusters for the exponential phase level
3L # 3 clusters for the stationary phase level
)
report_dir <- here::here(
"results",
"clustering_reports"
)
plot_info = list( # For the spline plots
y_axis_label = "log2 intensity",
time_unit = "min", # our measurements were in minutes
treatment_labels = list("feeding"), # add this for all conditions
treatment_timepoints = list(0) # Feeding was at 0 minutes.
)
# Like this you can add individual treatment labels to your plots:
# treatment_labels = list(
# exponential = "treatment 1", # One treatment in exp
# stationary = c("treatment 2", "treatment 3") # Two treatments in stat
# additional_condition = NA # No treatment in the hypothetical third condition
# )
#
# treatment_timepoints = list(
# exponential = 0,
# stationary = C(100, 140), # Two treatments also need two timepoints
# additional_condition = NA
# )
#
# or set a treatment for ALL conditions (still always make a list):
#
# treatment_labels = list("treatment")
# treatment_timepoints = list(120)
#
# or set multiple treatments for ALL conditions:
#
# treatment_labels = list(c("treatment1", "treatment2"))
# treatment_timepoints = list(c(120, 90))
# Get all the gene names. They are used for generating files
# which contents can be directly used as the input for the Enrichr webtool,
# if you prefer to manually perform the enrichment. Those files are
# embedded in the output HTML report and can be downloaded from there.
gene_column_name <- "Gene_symbol"
genes <- annotation[[gene_column_name]]
plot_options = list(
# When meta_replicate_column is not there, all datapoints are blue.
meta_replicate_column = "Reactor", # Colors the data points based on Reactor
cluster_heatmap_columns = FALSE # Per default FALSE, just for demonstration
)
clustering_results <- SplineOmics::cluster_hits(
splineomics = splineomics,
adj_pthresholds = adj_pthresholds,
clusters = clusters,
genes = genes,
plot_info = plot_info,
plot_options = plot_options,
report_dir = report_dir,
adj_pthresh_avrg_diff_conditions = 0,
adj_pthresh_interaction_condition_time = 0.25
)
You can view the generated analysis report of the cluster_hits function here.
As discussed before, there are three limma result categories. The cluster_hits() report shows the results of all three, if they are present (category 2 and 3 can only be generated when the design formula contains an interaction effect).
Perform gene set enrichment analysis (GSEA)
Usually, the final step in such a bioinformatics analysis is GSEA. To each clustered hit, the respective gene can be assigned and GSEA performed. For this, the Enrichr databases of choice have to be downloaded:
# Specify which databases you want to download from Enrichr
gene_set_lib <- c(
"WikiPathways_2019_Human",
"NCI-Nature_2016",
"TRRUST_Transcription_Factors_2019",
"MSigDB_Hallmark_2020",
"GO_Cellular_Component_2018",
"CORUM",
"KEGG_2019_Human",
"TRANSFAC_and_JASPAR_PWMs",
"ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X",
"GO_Biological_Process_2018",
"GO_Molecular_Function_2018",
"Human_Gene_Atlas"
)
SplineOmics::download_enrichr_databases(
gene_set_lib = gene_set_lib,
filename = "databases.tsv")
Per default the file is placed in the current working directory, which is the root dir of the R project.
To run GSEA, the downloaded database file has to be loaded as a dataframe. Further, optionally, the clusterProfiler parameters and the report dir can be specified. The function create_gsea_report() runs GSEA using clusterProfiler, generates an HTML report and returns the GSEA dotplots in R.
# Specify the filepath of the TSV file with the database info
downloaded_dbs_filepath <- here::here("databases.tsv")
# Load the file
databases <- read.delim(
downloaded_dbs_filepath,
sep = "\t",
stringsAsFactors = FALSE
)
# Specify the clusterProfiler parameters
clusterProfiler_params <- list(
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
minGSSize = 10,
maxGSSize = 500,
qvalueCutoff = 0.2
)
report_dir <- here::here(
"results",
"gsea_reports"
)
The function below runs the clusterProfiler for all clusters and all levels, and generates the HTML report:
result <- SplineOmics::run_gsea(
# A dataframe with three columns: feature, cluster, and gene. Feature contains
# the integer index of the feature, cluster the integer specifying the cluster
# number, and gene the string of the gene, such as "CLSTN2".
levels_clustered_hits = clustering_results$clustered_hits_levels,
databases = databases,
clusterProfiler_params = clusterProfiler_params,
report_info = report_info,
report_dir = report_dir
)
You can view the generated analysis report of the cluster_hits function here.
This report first shows all enrichment results, where more than 2 genes supported a term, in a tabular format. The table with all the terms with < 2 genes supporting it can be downloaded by clicking on a button below that table.
For the dotplots below that, every row is a term from a specific database, and the columns are the respective clusters. The color scale contains the info about the odds ratio and the size the -log10 adj. p-value. Only terms that have > 2 genes as support are included in the plot. Further, for each cluster, just maximally 5 terms are shown (the terms with the highest odds ratios). Note that when for example cluster 1 already has 5 terms, and cluster 2 does not, and gets a term which was also found for cluster 1, than this term would be included as the sixth term for cluster 1, so this is a way the maximum of 5 can be exceeded.
If a phase, like stationary here, does not lead to any enrichment results, that is stated with a red message.
Conclusion
This example showed most functionalities of the SplineOmics package. You can also run other datatypes with it, including timeseries RNA-seq and glycan data (for those, refer to the documentation in the README file on the GitHub page under Usage/RNA-seq and Glycan Data).
To get an interactive version of this example, download the
SplineOmics package and run the function open_tutorial()
which opens an R Markdown file, where you can run the different code
blocks and if your are working in R Studio (which is recommendet) you
can easily check out the values of the individual variables and generate
the output reports yourself.
When you run the function open_template()
you get a
minimal R Markdown file, where the code is written so that you can use
it as a skeleton to plug in your own data and run it.
We hope that the SplineOmics package makes your scientific data analysis easier. If you face any problems (bugs in the code) or are not satisfied with the documentation, open an issue on GitHub or check out the other options under the Feedback section of the README on GitHub. Thank you!