get-started

About this tutorial

This tutorial intends to showcase and explain the capabilities of the SplineOmics package by walking through a real and complete example, from start to finish.

Example Overview

The example involves a time-series proteomics experiment, where CHO (chinese hamster ovary) cells were cultivated in three bioreactors (three biological replicates). The experiment includes the following setup:

Samples were taken both during the exponential and stationary growth phases.
Samples were collected in triplicates from each reactor at defined timepoints relative to cell feeding:
- 60 minutes before feeding
- 15, 60, 90, 120, and 240 minutes after feeding

Analysis Goals

The main goals of this analysis are:

Identify proteins with significant temporal changes: Out of 7162 cellular proteins, the objective is to detect which proteins show a significant change over time after the CHO cells were fed (i.e., the impact of the feeding).
Cluster hits based on temporal patterns: The proteins (hits) with significant temporal changes will be clustered according to their time-based patterns.
Perform gene set enrichment analysis: For each cluster, a gene set enrichment analysis will be performed to determine if specific biological processes are up- or downregulated after feeding.

Note

The documentation of all the SplineOmics package functions can be viewed here

Load the packages

library(SplineOmics)
library(here) # For managing filepaths
#> here() starts at /home/thomas/Documents/PhD/projects/DGTX/SplineOmics_hub/SplineOmics
library(dplyr) # For data manipulation
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(knitr) # For Showing the head of the data and the meta tables.

Load the files

In this example, the proteomics_data.rds file contains the numeric values (the intensities) and also the feature descriptions, such as gene and protein name (= annotation part). Usually, you would load the data from for example an Excel file, but the .rds file is more compressed, which is the reason this format was chosen here to limit the size of the SplineOmics package.

The file meta.xlsx contains the meta information, which are the descriptions of the columns of the numeric values of data.

(These example files are part of the package and don’t have to be present on your system).

Please note that this dataset is an actual experimental dataset, but the annotation information, such as gene names, has been removed since it was not yet published at the time of making the SplineOmics package public. Instead, the dataset includes randomly generated gene symbols and gene names corresponding to Cricetulus griseus (Chinese Hamster) for each row. This is intended to demonstrate the functionality of the package.

The left part of data contains the numeric values, and the right part the annotation info, which can be copied in a separate dataframe, as shown below.

data <- readRDS(xzfile(system.file(
  "extdata",
  "proteomics_data.rds.xz",
  package = "SplineOmics"
)))

meta <- read.csv(
  system.file(
    "extdata",
    "proteomics_meta.csv",
    package = "SplineOmics"
  ),
  stringsAsFactors = FALSE
)

# Extract the annotation part from the dataframe.
first_na_col <- which(is.na(data[1, ]))[1]
annotation <- data |>
  dplyr::select((first_na_col + 1):ncol(data)) |>
  dplyr::slice(-c(1:3))

Show top rows of data

kable(
  head(data),
  format = "markdown"
  )

Sample ID	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	…38	Gene_symbol	Gene_name
Reactor	E09	E10	E12	E09	E10	E12	E09	E10	E12	E09	E10	E12	E09	E10	E12	E09	E10	E12	E09	E10	E12	E09	E10	E12	E09	E10	E12	E09	E10	E12	E09	E10	E12	E09	E10	E12	NA	NA	NA
Time Point	TP01	TP01	TP01	TP02	TP02	TP02	TP03	TP03	TP03	TP04	TP04	TP04	TP05	TP05	TP05	TP06	TP06	TP06	TP07	TP07	TP07	TP08	TP08	TP08	TP09	TP09	TP09	TP10	TP10	TP10	TP11	TP11	TP11	TP12	TP12	TP12	NA	NA	NA
Phase of Fermentation	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Exponential	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	Stationary	NA	NA	NA
NA	15.2908248901367	15.2148580551147	15.3249082565308	15.1468906402588	15.2205858230591	15.0987968444824	15.20960521698	15.2530574798584	15.2426490783691	15.1146211624146	15.2669916152954	15.3227806091309	15.2889413833618	15.2406797409058	15.3296279907227	15.1914215087891	15.2090301513672	15.2645998001099	16.4583435058594	16.5065593719482	16.7934131622314	16.60471534729	16.516918182373	16.7601490020752	16.7143936157227	16.4811534881592	16.9271640777588	16.6487922668457	16.5887832641602	16.817533493042	16.7007789611816	16.4686756134033	16.8468856811523	16.7374858856201	16.4771633148193	16.8459930419922	NA	LOC113838844	cone-rod homeobox protein-like
NA	15.0816164016724	15.1488542556763	15.2205448150635	15.1780500411987	15.1131563186646	15.2813320159912	15.2288188934326	15.3091497421265	15.2797374725342	15.1128349304199	15.1366987228394	15.1883420944214	15.2453184127808	15.2480907440186	15.1772556304932	15.0709342956543	14.9833717346191	15.1247396469116	15.523494720459	15.4470376968384	15.3960628509521	15.4655313491821	15.4535474777222	15.4290637969971	15.2166357040405	15.3904008865356	15.2344093322754	15.4606895446777	15.4035310745239	15.338812828064	15.3438587188721	15.4040069580078	15.3590221405029	15.2574281692505	15.4244060516357	15.3422203063965	NA	Wdr83os	WD repeat domain 83 opposite strand
NA	14.5862588882446	14.7493143081665	14.6296472549438	14.5930347442627	14.6145257949829	14.6926259994507	14.5930233001709	14.6085433959961	14.7053565979004	14.5293817520142	14.6346235275269	14.7601327896118	14.5953159332275	14.6270151138306	14.7492513656616	14.6475257873535	14.6990842819214	14.6165771484375	14.9039011001587	14.8870449066162	15.0393772125244	14.8328456878662	14.9858465194702	15.0382928848267	14.8647422790527	14.7847366333008	15.1070947647095	14.8750152587891	15.0755500793457	15.021595954895	14.9314804077148	14.8654098510742	15.1668844223022	14.8610773086548	14.909330368042	14.9585390090942	NA	Cubn	cubilin

Show top rows of meta

kable(
  head(meta),
  format = "markdown"
  )

Sample.ID	Reactor	Time.Point	Phase	Time
E09_TP01_Exponential	E09	TP01	Exponential	-60
E10_TP01_Exponential	E10	TP01	Exponential	-60
E12_TP01_Exponential	E12	TP01	Exponential	-60
E09_TP02_Exponential	E09	TP02	Exponential	15
E10_TP02_Exponential	E10	TP02	Exponential	15
E12_TP02_Exponential	E12	TP02	Exponential	15

Show top rows of annotation

kable(
  head(annotation),
  format = "markdown"
  )

Gene_symbol	Gene_name
LOC113838844	cone-rod homeobox protein-like
Wdr83os	WD repeat domain 83 opposite strand
Cubn	cubilin
Dynlt1	dynein light chain Tctex-type 1
Ostc	oligosaccharyltransferase complex non-catalytic subunit
Unc5cl	unc-5 family C-terminal like

Three comments about the characteristics the input data should have:

The data must not contain any NA values or other special values, and must consist only of numbers. For example, the original proteomics data contained some NA values, which were resolved in this case by imputation (replacing NA values with numbers).
All features of the data should ideally be normally distributed when analyzed with limma, which fits a linear model to each feature. These models rely on statistical tests that assume normality. Although limma can still function if the data is not normally distributed, the resulting p-values may become less reliable. For this reason, it is strongly recommended to transform your data using techniques such as log2 transformation when the features deviate from normality. Proper transformation helps ensure that the assumptions underlying the statistical tests are met, leading to more accurate and trustworthy results.
The samples in the data should be independent of each other. Linear models,
such as those used in limma, assume that the observations (samples) are
independent. If there is a dependency between samples (e.g., repeated
measurements of the same subject), this assumption is violated, which can lead to incorrect statistical inferences.

Bring the Inputs into the Standardized Format

Since data is not in the format required by the SplineOmics package, it needs some processing. The SplineOmics package requires data to be a numeric matrix, so no element is allowed to be anything else than a number. This can be done with a few commands in R, but if your file has a specific structure, the function extract_data() can handle this more easily

Usage of the extract_data() function

extract_data() can:

Extract the data matrix field by specifying the location of the corners of the matrix.
Create column headers from the information written in the cells above the respective columns of the data matrix field.
Assign rowheaders:
- If no annotation columns are specified, rowheaders will be increasing numbers.
- If annotation columns are specified (like "First.Protein.Description" and "ID" in this example), these will be combined to form the rowheaders (feature names).

Usage in Plotting

The generated rowheaders will be used to label any plots where a feature is shown individually, such as:

Spline plots with the datapoints from an individual feature.

data <- SplineOmics::extract_data(
  # The dataframe with the numbers on the left and info on the right.
  data = data,
  # Use this annotation column for the feature names.
  feature_name_columns = c("Gene_name"),
  top_row = 4,
  bottom_row = 4165,
  right_col = 37,
  left_col = 2
)

Perform EDA (exploratory data analysis)

Now that we have the data in the required format (numeric matrix) we can go on.

The first step in analyzing data is typically Exploratory Data Analysis (EDA). EDA involves summarizing the main characteristics of the data, often through visualizations.

Common EDA Plots

Some common types of EDA plots include:

Density distributions
Boxplots
PCA (Principal Component Analysis)
Correlation heatmaps

Again, you can generate those plots yourself with a few lines of R code. However, if you prefer, for convenience, the explore_data() function can handle this for you.

Using `explore_data()` for EDA

The SplineOmics package provides the function explore_data() to perform EDA. This function requires the following arguments:

data: The numeric data matrix.
meta: The metadata table.
condition: The name of the column in the metadata that contains the levels of the experiment (e.g., “Exponential” and “Stationary”).
report_info: A list that contains general information about the analysis, such as the name of the analyst and the datatype (e.g. proteomics)

Optional Arguments

In addition to the required arguments, explore_data() offers several optional arguments:

meta_batch_column: The name of the column that contains the first batch effect.
meta_batch2_column: The name of the column that contains the second batch effect.

If at least one batch column is provided, the function will:
- Use the removeBatchEffect() function from limma to remove the batch effect from the data before plotting.
- Generate two EDA HTML reports: one for the uncorrected data and one for the batch-corrected data.

Output and Report Options

By default, the reports are saved in the current working directory, but this location can be changed using the report_dir argument.
The function also returns all plots generated during the analysis, so that you can modify them according to your own needs.
If you do not want a report to be generated, you can set the report argument to FALSE (when you for example just want the figures in the R environment)

# Those fields are mandatory, because we believe that when such a report is
# opened after half a year, those infos can be very helpful.
report_info <- list(
  omics_data_type = "PTX",
  data_description = "Proteomics data of CHO cells",
  data_collection_date = "February 2024",
  analyst_name = "Thomas Rauter",
  contact_info = "thomas.rauter@plus.ac.at",
  project_name = "DGTX"
)

report_dir <- here::here(
  "results",
  "explore_data"
)

SplineOmics Object

In the SplineOmics package, multiple functions take the same arguments as input. To make this easier and to avoid errors, we decided that those arguments are not provided individually to the functions, but are all stored in an R6 object (which is of type ‘SplineOmics’) and then this object is passed to the functions. Additionally, some functions generate intermediate output, which is just necessary for the next function in the workflow, which is then also just passed along by updating the SplineOmics object. But you don’t have to worry about this.

Functionality

The SplineOmics object can be seen as a container where all necessary arguments are stored. Each function retrieves the required arguments from the object and potentially adds new data or results back into it.

Documentation

The documentation of the function that creates the SplineOmics object can be found here and the documentation of the function that updates it [here

The documentation for each function that takes the SplineOmics object as input specifies which arguments must be present in the SplineOmics object when it is passed to the respective function.

Required Arguments `create_splineomics()`

data: A matrix with the data
meta: Metadata associated with the data.
condition: Meta column name of the levels (e.g., Exponential and Stationary).

Optional Arguments `create_splineomics()`

rna_seq_data: An object containing the preprocessed RNA-seq data, such as the output from limma::voom function.
annotation: A dataframe with the feature descriptions of data.
report_info: A list containing general information about the analysis.
meta_batch_column: Column for meta batch information.
meta_batch2_column: Column for secondary meta batch information.
design: A limma design formula
spline_params: Parameters for the spline functions.

# splineomics now contains the SplineOmics object.
splineomics <- SplineOmics::create_splineomics(
  data = data,
  meta = meta,
  annotation = annotation,
  report_info = report_info,
  condition = "Phase", # Column of meta that contains the levels.
  meta_batch_column = "Reactor" # For batch effect removal
)

# Special print.SplineOmics function leads to selective printing
print(splineomics)
#> data:SplineOmics Object
#> -------------------
#> Number of features (rows): 4162 
#> Number of samples (columns): 36 
#> Meta data columns: 5 
#> First few meta columns:
#>              Sample.ID Reactor Time.Point       Phase Time
#> 1 E09_TP01_Exponential     E09       TP01 Exponential  -60
#> 2 E10_TP01_Exponential     E10       TP01 Exponential  -60
#> 3 E12_TP01_Exponential     E12       TP01 Exponential  -60
#> Condition: Phase 
#> No RNA-seq data provided.
#> Annotation provided with 4162 entries.
#> No spline parameters set.
#> P-value adjustment method: BH

Now that we have the SplineOmics object defined, we can perform our exploratory data analysis.

plots <- SplineOmics::explore_data(
  splineomics = splineomics, # SplineOmics object
  report_dir = report_dir
)

Here you can see the HTML report of the explore_data() function with the NOT batch-corrected data, and here the report for the batch-corrected data.

The EDA plots can tell you a range of things. The plots in the HTML report are grouped into three categories: Distribution and Variability Analysis, Time Series Analysis, and Dimensionality Reduction and Clustering.

If you look at the correlation heatmaps in the HTML report, you can see that the samples E12_TP05_Exponential and E10_TP10_Stationary stick out. Seeing this, you might want to remove them from the data. You can test out what happens when you do this, along with testing how other hyperparameter choices influence the results, with the package function screen_limma_hyperparams().

Finding the Best Hyperparameters

Before running the limma spline analysis, it is important to find the best “hyperparameters”. In this context, hyperparameters include:

Degree of freedom (DoF)
Different versions of the data (e.g., outlier removed vs. not removed)
Different limma design formulas

Challenge of Hyperparameter Selection

Rationally determining the best combination of hyperparameters can be very challenging. By rationally, I mean deciding upon the final hyperparameters without ever testing any, just by scientific reasoning. It is much easier just testing a few and seeing how they actually behave. However, manually selecting combinations can be tedious, and you have to work very systematically, which can be challenging. To solve this problem, the screen_limma_hyperparams() function was written.

Using `screen_limma_hyperparams()`

The function screen_limma_hyperparams() automates the process of testing different combinations of hyperparameters. Here’s how it works:

Specify values: For each hyperparameter, you can specify all the values you want to test.
Run combinations: The function runs the limma spline analysis with combinations formed from the hyperparameters you’ve provided in a semi combinatorial way.

Inner vs. Outer Hyperparameters

Semi combinatorial here means that not every possible combination is generated. Instead, there are inner and outer hyperparameters:

Outer hyperparameters: These include things like different versions of the dataset (e.g., full dataset vs. dataset with outliers removed).
- All possible combinations of outer hyperparameters are generated.
Inner hyperparameters: These include adjusted p-value thresholds and spline parameters (e.g., degree of freedom).
- For each version of the data (outer hyperparameter), all combinations of inner hyperparameters are tested.

This approach is neccessary, because otherwise the amount of combos would explode.

Example

For example, if you have two versions of a dataset (one full dataset, and one with some outliers removed), these versions are considered outer hyperparameters. Additionaly, lets say, you want to test two different limma design formulas, formula 1 and 2. The function will test out all combinations of those outer hyperparameters and compare them with each other, which results in a total of 6 combinations here:

Full Dataset Formula 1 vs Full Dataset Formula 2
Full Dataset Formula 1 vs Outliers Removed Dataset Formula 1
Full Dataset Formula 1 vs Outliers Removed Dataset Formula 2
Full Dataset Formula 2 vs Outliers Removed Dataset Formula 1
Full Dataset Formula 2 vs Outliers Removed Dataset Formula 2
Outliers Removed Dataset Formula 1 vs Outliers Removed Dataset Formula 2

Let’s say you specified the following inner hyperparameters:

Spline parameters: Natural cubic splines with a degree of freedom of either 2 or 3.
Adjusted p-value threshold: 0.05 or 0.1.

The function will generate and test all combinations of the spline parameters and p-value thresholds for all 4 combos:

Combo 1:

DoF = 2, threshold = 0.05
DoF = 3, threshold = 0.05
DoF = 2, threshold = 0.1
DoF = 3, threshold = 0.1

Combo 2:

DoF = 2, threshold = 0.05
DoF = 3, threshold = 0.05
DoF = 2, threshold = 0.1
DoF = 3, threshold = 0.1

Combo 3: …

This allows you to systematically explore different combinations and select the optimal hyperparameters for your analysis.

Below is an example for our proteomics data:

data1 <- data
meta1 <- meta

# Remove the "outliers"
data2 <- data[, !(colnames(data) %in% c(
  "E12_TP05_Exponential",
  "E10_TP10_Stationary"
)
)]

# Adjust meta so that it matches data2
meta2 <- meta[!meta$Sample.ID %in% c(
  "E12_TP05_Exponential",
  "E10_TP10_Stationary"
), ]

# As mentioned above, all the values of one hyperparameter are stored
# and provided as a list.
datas <- list(data1, data2)

# This will be used to describe the versions of the data.
datas_descr <- c(
  "full_data",
  "outliers_removed"
)

metas <- list(meta1, meta2)

# Test two different limma designs
designs <- c(
  "~ 1 + Phase*Time + Reactor",
  "~ 1 + Time + Reactor"
)

# 'Integrated means' limma will use the full dataset to generate the results for
# each condition. 'Isolated' means limma will use only the respective part of
# the dataset for each condition. Designs that contain the condition column
# (here Phase) must have mode 'integrated', because the full data is needed to
# include the different conditions into the design formula.
modes <- c(
  "integrated",
  "isolated"
)

# Specify the meta "level" column
condition <- "Phase"

report_dir <- here::here(
  "results",
  "hyperparams_screen_reports"
)

# To remove the batch effect
meta_batch_column <- "Reactor"

# Test out two different p-value thresholds (inner hyperparameter)
adj_pthresholds <- c(
  0.05,
  0.1
)

# Create a dataframe with combinations of spline parameters to test
# (every row a combo to test)
spline_test_configs <- data.frame(
  # 'n' stands for natural cubic splines, b for B-splines.
  spline_type = c("n", "n", "b", "b"),
  # Degree is not applicable (NA) for natural splines.
  degree = c(NA, NA, 2L, 4L),
  # Degrees of freedom (DoF) to test.
  # Higher dof means spline can fit more complex patterns.
  dof = c(2L, 3L, 3L, 4L)
)

print(spline_test_configs)
#>   spline_type degree dof
#> 1           n     NA   2
#> 2           n     NA   3
#> 3           b      2   3
#> 4           b      4   4

Now that we specified all the values for each hyperparameter that we want to test, we can run the screen_limma_hyperparams() function.

SplineOmics::screen_limma_hyperparams(
  splineomics = splineomics,
  datas = datas,
  datas_descr = datas_descr,
  metas = metas,
  designs = designs,
  modes = modes,
  spline_test_configs = spline_test_configs,
  report_dir = report_dir,
  adj_pthresholds = adj_pthresholds,
)

As mentioned, this function generates a report for each comparison of the outer hyperparameters, which are too many to show here. You can view an example report here

This report contains the results for the comparison of the “outer” hyperparameters data 1 and design (formula) 1 compared against data 1 and design 2. For both of those, all combinations of the “inner” hyperparameters are generated (every possible combination of all specified adj. p-value thresholds and spline configs).

The encoding used in the reports and the titles is here (This is part of the output of the screen_limma_hyperparams function).

Run limma spline analysis

Once we identified the hyperparameters that are likely the best ones, we can run the limma spline analysis with them and get the results.

Lets just assume for now that the new parameters, with which the SplineOmics object is updated, are the best for this analysis. The choice depends on the analysis. For example, for this analysis, natural cubic splines (n) with a dof of two seemed to fit the data best (not overfitting, but also not underfitting), which was the reason those spline parameters were chosen.

For the design formula, you must specify either ‘isolated’ or ‘integrated’. Isolated means limma determines the results for each level using only the data from that level. Integrated means limma determines the results for all levels using the full dataset (from all levels). For the integrated mode, the condition column (here Phase) must be included in the design. Isolated means that limma only uses the part of the dataset that belongs to a level to obtain the results for that level.

To generate the limma result categories 2 and 3 ()

splineomics <- SplineOmics::update_splineomics(
  splineomics = splineomics,
  design = "~ 1 + Phase*Time + Reactor", # best design formula
  mode = "integrated", # means limma uses the full data for each condition.
  data = data2, # data without "outliers" was better
  meta = meta2,
  # States explicitly that there is no problem of heteroscedasticity and 
  # therefore, this does not need to be adressed. Setting it to TRUE would mean
  # the opposite, and when setting it to NULL, it means it should be handled 
  # implicitly. For details, see Reference
  # documentation of the create_splineomics() function.
  use_array_weights = FALSE,
  spline_params = list(
    spline_type = c("n"), # natural cubic splines (take these if unsure)
    dof = c(2L) # If you are unsure about which dof, start with 2 and increase
  )
)

Run the run_limma_splines() function with the updated SplineOmics object:

splineomics <- SplineOmics::run_limma_splines(
  splineomics = splineomics
)
#> Column 'Reactor' of meta will be used to remove the batch effect for the plotting
#> Make sure that the design formula contains no interaction between the condition and time for mode == isolated, and that it contains an interaction for mode == integrated. Otherwise, you will get an uncaught error of 'coefficients not estimable' or 'subscript out of bounds'.
#> Info limma spline analysis completed successfully

The output of the function run_limma_splines() is a named list, where each element is a specific “category” of results. Refer to this document for an explanation of the different result categories. Each of those elements is a list, containing as elements the respective limma topTables, either for each level or each comparison between two levels.

The element “time_effect” is a list, where each element is the topTable where the p-value for each feature for the respective level are reported.

The element “avrg_diff_conditions” is a list that contains as elements the topTables, that represent the comparison of the average differences of the levels.

The element “interaction_condition_time” is a list that contains as elements the topTables, that represent the interaction between the levels (which includes both time and the average differences)

Build limma report

The topTables of all three limma result categories can be used to generate p-value histograms an volcano plots.

report_dir <- here::here(
  "results",
  "create_limma_reports"
)

plots <- SplineOmics::create_limma_report(
  splineomics = splineomics,
  report_dir = report_dir
)

You can view the generated analysis report of the create_limma_report function here.

This report contains p-value histograms for all three limma result categories and a volcano plot for category 2. Embedded in this file are the downloadable limma topTables with the results for category 1 if the mode was ‘isolated’ and also with the results of category 2 and 3 if the mode was ‘integrated’.

Note that in the upcoming cluster_hits() function report, the embedded file will only contain the clustered significant features from result category 1.

Cluster the hits (significant features)

After we obtained the limma spline results, we can cluster the hits based on their temporal pattern (their spline shape). We define what a hit is by setting an adj. p-value threshold for every level. Hits are features (e.g. proteins) that have an adj. p-value below the threshold. Hierarchical clustering is used to place every hit in one of as many clusters as we have specified for that specific level.

adj_pthresholds <- c( # 0.05 for both levels
  0.05, # exponential
  0.05 # stationary
)

clusters <- c(
  6L, # 6 clusters for the exponential phase level
  3L # 3 clusters for the stationary phase level
)

report_dir <- here::here(
  "results",
  "clustering_reports"
)

plot_info <- list( # For the spline plots
  y_axis_label = "log2 intensity",
  time_unit = "min", # our measurements were in minutes
  treatment_labels = list("feeding"), # add this for all conditions
  treatment_timepoints = list(0) # Feeding was at 0 minutes.
)

# Like this you can add individual treatment labels to your plots:
# treatment_labels = list(
#   exponential = "treatment 1", # One treatment in exp
#   stationary = c("treatment 2", "treatment 3"),  # Two treatments in stat
#   additional_condition = NA  # No treatment in the hypothetical third condition
#   )
#
# treatment_timepoints = list(
#   exponential = 0,
#   stationary = c(100, 140),  # Two treatments also need two timepoints
#   additional_condition = NA
#   )
#
# or set a treatment for ALL conditions (still always make a list):
#
# treatment_labels = list("treatment")
# treatment_timepoints = list(120)
#
# or set multiple treatments for ALL conditions:
#
# treatment_labels = list(c("treatment1", "treatment2"))
# treatment_timepoints = list(c(120, 90))


# Get all the gene names. They are used for generating files
# which contents can be directly used as the input for the Enrichr webtool,
# if you prefer to manually perform the enrichment. Those files are
# embedded in the output HTML report and can be downloaded from there.
gene_column_name <- "Gene_symbol"
genes <- annotation[[gene_column_name]]

plot_options <- list(
  # When meta_replicate_column is not there, all datapoints are blue.
  meta_replicate_column = "Reactor", # Colors the data points based on Reactor
  cluster_heatmap_columns = FALSE # Per default FALSE, just for demonstration
)

clustering_results <- SplineOmics::cluster_hits(
  splineomics = splineomics,
  adj_pthresholds = adj_pthresholds,
  clusters = clusters,
  genes = genes,
  plot_info = plot_info,
  plot_options = plot_options,
  report_dir = report_dir,
  adj_pthresh_avrg_diff_conditions = 0,
  adj_pthresh_interaction_condition_time = 0.25
)

You can view the generated analysis report of the cluster_hits function here.

As discussed before, there are three limma result categories. The cluster_hits() report shows the results of all three, if they are present (category 2 and 3 can only be generated when the design formula contains an interaction effect).

Perform gene set enrichment analysis (GSEA)

Once the clustered hits are identified, a subsequent step to gain biological insights is to perform GSEA. For this, the respective genes can be assigned to each clustered hit, and GSEA can be carried out. To proceed, the Enrichr databases of choice need to be downloaded:

# Specify which databases you want to download from Enrichr
gene_set_lib <- c(
  "WikiPathways_2019_Human",
  "NCI-Nature_2016",
  "TRRUST_Transcription_Factors_2019",
  "MSigDB_Hallmark_2020",
  "GO_Cellular_Component_2018",
  "CORUM",
  "KEGG_2019_Human",
  "TRANSFAC_and_JASPAR_PWMs",
  "ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X",
  "GO_Biological_Process_2018",
  "GO_Molecular_Function_2018",
  "Human_Gene_Atlas"
)

SplineOmics::download_enrichr_databases(
  gene_set_lib = gene_set_lib,
  filename = "databases.tsv"
)

Per default the file is placed in the current working directory, which is the root dir of the R project.

To run GSEA, the downloaded database file has to be loaded as a dataframe. Further, optionally, the clusterProfiler parameters and the report dir can be specified. The function create_gsea_report() runs GSEA using clusterProfiler, generates an HTML report and returns the GSEA dotplots in R.

# Specify the filepath of the TSV file with the database info
downloaded_dbs_filepath <- here::here("databases.tsv")

# Load the file
databases <- read.delim(
  downloaded_dbs_filepath,
  sep = "\t",
  stringsAsFactors = FALSE
)

# Specify the clusterProfiler parameters
clusterProfiler_params <- list(
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  minGSSize = 10,
  maxGSSize = 500,
  qvalueCutoff = 0.2
)

report_dir <- here::here(
  "results",
  "gsea_reports"
)

The function below runs the clusterProfiler for all clusters and all levels, and generates the HTML report:

result <- SplineOmics::run_gsea(
  # A dataframe with three columns: feature, cluster, and gene. Feature contains
  # the integer index of the feature, cluster the integer specifying the cluster
  # number, and gene the string of the gene, such as "CLSTN2".
  levels_clustered_hits = clustering_results$clustered_hits_levels,
  databases = databases,
  clusterProfiler_params = clusterProfiler_params,
  report_info = report_info,
  report_dir = report_dir
)

You can view the generated analysis report of the run_gsea function here.

This report first shows all enrichment results, where more than 2 genes supported a term, in a tabular format. The table with all the terms with < 2 genes supporting it can be downloaded by clicking on a button below that table.

For the dotplots below that, every row is a term from a specific database, and the columns are the respective clusters. The color scale contains the info about the odds ratio and the size the -log10 adj. p-value. Only terms that have > 2 genes as support are included in the plot. Further, for each cluster, just maximally 5 terms are shown (the terms with the highest odds ratios). Note that when for example cluster 1 already has 5 terms, and cluster 2 does not, and gets a term which was also found for cluster 1, than this term would be included as the sixth term for cluster 1, so this is a way the maximum of 5 can be exceeded.

If a phase, like stationary here, does not lead to any enrichment results, that is stated with a red message.

Conclusion

This example showed most functionalities of the SplineOmics package. You can also run other datatypes with it, including timeseries RNA-seq and glycan data (for those, refer to the documentation in the README file on the GitHub page under Usage/RNA-seq and Glycan Data).

To get an interactive version of this example, download the SplineOmics package and run the function open_tutorial() which opens an R Markdown file, where you can run the different code blocks and if your are working in R Studio (which is recommendet) you can easily check out the values of the individual variables and generate the output reports yourself.

We hope that the SplineOmics package makes your scientific data analysis easier. If you face any problems (bugs in the code) or are not satisfied with the documentation, open an issue on GitHub or check out the other options under the Feedback section of the README on GitHub. Thank you!

Session Info

#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/local/R-4.3.3/lib/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=de_AT.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Vienna
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices datasets  utils     methods   base     
#> 
#> other attached packages:
#> [1] knitr_1.50        dplyr_1.1.4       here_1.0.1        SplineOmics_0.2.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] Rdpack_2.6.4             bitops_1.0-9             gridExtra_2.3           
#>   [4] rlang_1.1.6              magrittr_2.0.3           clue_0.3-66             
#>   [7] GetoptLong_1.0.5         matrixStats_1.5.0        compiler_4.3.3          
#>  [10] reshape2_1.4.4           png_0.1-8                systemfonts_1.2.3       
#>  [13] vctrs_0.6.5              stringr_1.5.1            pkgconfig_2.0.3         
#>  [16] shape_1.4.6.1            crayon_1.5.3             fastmap_1.2.0           
#>  [19] backports_1.5.0          caTools_1.18.3           rmarkdown_2.29          
#>  [22] nloptr_2.2.1             ragg_1.4.0               purrr_1.0.4             
#>  [25] xfun_0.52                cachem_1.1.0             jsonlite_2.0.0          
#>  [28] progress_1.2.3           EnvStats_3.1.0           remaCor_0.0.18          
#>  [31] BiocParallel_1.36.0      broom_1.0.8              parallel_4.3.3          
#>  [34] prettyunits_1.2.0        cluster_2.1.8.1          R6_2.6.1                
#>  [37] bslib_0.9.0              stringi_1.8.7            RColorBrewer_1.1-3      
#>  [40] limma_3.58.1             boot_1.3-31              car_3.1-3               
#>  [43] numDeriv_2016.8-1.1      jquerylib_0.1.4          Rcpp_1.0.14             
#>  [46] iterators_1.0.14         base64enc_0.1-3          IRanges_2.36.0          
#>  [49] Matrix_1.6-5             splines_4.3.3            tidyselect_1.2.1        
#>  [52] rstudioapi_0.17.1        abind_1.4-8              yaml_2.3.10             
#>  [55] viridis_0.6.5            doParallel_1.0.17        gplots_3.2.0            
#>  [58] codetools_0.2-20         plyr_1.8.9               lmerTest_3.1-3          
#>  [61] lattice_0.22-7           tibble_3.2.1             withr_3.0.2             
#>  [64] Biobase_2.62.0           evaluate_1.0.3           desc_1.4.3              
#>  [67] zip_2.3.3                circlize_0.4.16          pillar_1.10.2           
#>  [70] BiocManager_1.30.25      carData_3.0-5            KernSmooth_2.23-26      
#>  [73] renv_1.1.4               foreach_1.5.2            stats4_4.3.3            
#>  [76] reformulas_0.4.1         generics_0.1.4           rprojroot_2.0.4         
#>  [79] S4Vectors_0.40.2         hms_1.1.3                ggplot2_3.5.2           
#>  [82] scales_1.4.0             aod_1.3.3                minqa_1.2.8             
#>  [85] gtools_3.9.5             RhpcBLASctl_0.23-42      glue_1.8.0              
#>  [88] pheatmap_1.0.12          tools_4.3.3              fANCOVA_0.6-1           
#>  [91] dendextend_1.19.0        variancePartition_1.32.5 lme4_1.1-37             
#>  [94] openxlsx_4.2.8           mvtnorm_1.3-3            fs_1.6.6                
#>  [97] grid_4.3.3               tidyr_1.3.1              rbibutils_2.3           
#> [100] colorspace_2.1-1         nlme_3.1-168             patchwork_1.3.0         
#> [103] Formula_1.2-5            cli_3.6.5                textshaping_1.0.1       
#> [106] viridisLite_0.4.2        svglite_2.2.1            ComplexHeatmap_2.18.0   
#> [109] corpcor_1.6.10           gtable_0.3.6             sass_0.4.10             
#> [112] digest_0.6.37            BiocGenerics_0.48.1      pbkrtest_0.5.4          
#> [115] ggrepel_0.9.6            rjson_0.2.23             htmlwidgets_1.6.4       
#> [118] farver_2.1.2             htmltools_0.5.8.1        pkgdown_2.1.3           
#> [121] lifecycle_1.0.4          GlobalOptions_0.1.2      statmod_1.5.0           
#> [124] MASS_7.3-60.0.1

Thomas Rauter

20 September, 2024