Skip to contents

About this tutorial

This tutorial intends to showcase and explain the capabilities of the SplineOmics package by walking through a real and complete example, from start to finish.

Example Overview

The example involves a time-series proteomics experiment, where CHO (chinese hamster ovary) cells were cultivated in three bioreactors (three biological replicates). The experiment includes the following setup:

  • Samples were taken both during the exponential and stationary growth phases.
  • Samples were collected in triplicates from each reactor at defined timepoints relative to cell feeding:
    • 60 minutes before feeding
    • 15, 60, 90, 120, and 240 minutes after feeding

Analysis Goals

The main goals of this analysis are:

  • Identify proteins with significant temporal changes: Out of 7162 cellular proteins, the objective is to detect which proteins show a significant change over time after the CHO cells were fed (i.e., the impact of the feeding).
  • Cluster hits based on temporal patterns: The proteins (hits) with significant temporal changes will be clustered according to their time-based patterns.
  • Perform gene set enrichment analysis: For each cluster, a gene set enrichment analysis will be performed to determine if specific biological processes are up- or downregulated after feeding.

Note

The documentation of all the SplineOmics package functions can be viewed here

Load the packages

library(SplineOmics)
library(readxl)  # for loading Excel files
library(here)    # For managing filepaths
#> here() starts at /home/thomas/Documents/PhD/projects/DGTX/R_packages/SplineOmics
library(dplyr)   # For data manipulation
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Load the files

In this example, the proteomics_data.rds file contains the numeric values (the intensities) and also the feature descriptions, such as gene and protein name (= annotation part). Usually, you would load the data from for example an Excel file, but the .rds file is more compressed, which is the reason this format was chosen here to limit the size of the SplineOmics package.

The file meta.xlsx contains the meta information, which are the descriptions of the columns of the numeric values of data.

(These example files are part of the package and don’t have to be present on your system).

Please note that this dataset is an actual experimental dataset, but the annotation information, such as gene names, has been removed since it was not yet published at the time of making the SplineOmics package public. Instead, the dataset includes randomly generated gene symbols and gene names corresponding to Cricetulus griseus (Chinese Hamster) for each row. This is intended to demonstrate the functionality of the package.

The left part of data contains the numeric values, and the right part the annotation info, which can be copied in a separate dataframe, as shown below.


data <- readRDS(system.file(
  "extdata",
  "proteomics_data.rds",
  package = "SplineOmics"
  ))


meta <- read_excel(
  system.file(
    "extdata",
    "proteomics_meta.xlsx",
    package = "SplineOmics"
    )
  )

# Extract the annotation part from the dataframe.
first_na_col <- which(is.na(data[1,]))[1]
annotation <- data |>
  dplyr::select((first_na_col + 1):ncol(data)) |>
  dplyr::slice(-c(1:3))

print(head(data))
#> # A tibble: 6 × 40
#>   `Sample ID`  `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`   `10`  `11` 
#>   <chr>        <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Reactor      E09   E10   E12   E09   E10   E12   E09   E10   E12   E09   E10  
#> 2 Time Point   TP01  TP01  TP01  TP02  TP02  TP02  TP03  TP03  TP03  TP04  TP04 
#> 3 Phase of Fe… Expo… Expo… Expo… Expo… Expo… Expo… Expo… Expo… Expo… Expo… Expo…
#> 4 NA           15.2… 15.2… 15.3… 15.1… 15.2… 15.0… 15.2… 15.2… 15.2… 15.1… 15.2…
#> 5 NA           15.0… 15.1… 15.2… 15.1… 15.1… 15.2… 15.2… 15.3… 15.2… 15.1… 15.1…
#> 6 NA           14.5… 14.7… 14.6… 14.5… 14.6… 14.6… 14.5… 14.6… 14.7… 14.5… 14.6…
#> # ℹ 28 more variables: `12` <chr>, `13` <chr>, `14` <chr>, `15` <chr>,
#> #   `16` <chr>, `17` <chr>, `18` <chr>, `19` <chr>, `20` <chr>, `21` <chr>,
#> #   `22` <chr>, `23` <chr>, `24` <chr>, `25` <chr>, `26` <chr>, `27` <chr>,
#> #   `28` <chr>, `29` <chr>, `30` <chr>, `31` <chr>, `32` <chr>, `33` <chr>,
#> #   `34` <chr>, `35` <chr>, `36` <chr>, ...38 <lgl>, Gene_symbol <chr>,
#> #   Gene_name <chr>
print(meta)
#> # A tibble: 36 × 5
#>    Sample.ID            Reactor Time.Point Phase        Time
#>    <chr>                <chr>   <chr>      <chr>       <dbl>
#>  1 E09_TP01_Exponential E09     TP01       Exponential   -60
#>  2 E10_TP01_Exponential E10     TP01       Exponential   -60
#>  3 E12_TP01_Exponential E12     TP01       Exponential   -60
#>  4 E09_TP02_Exponential E09     TP02       Exponential    15
#>  5 E10_TP02_Exponential E10     TP02       Exponential    15
#>  6 E12_TP02_Exponential E12     TP02       Exponential    15
#>  7 E09_TP03_Exponential E09     TP03       Exponential    60
#>  8 E10_TP03_Exponential E10     TP03       Exponential    60
#>  9 E12_TP03_Exponential E12     TP03       Exponential    60
#> 10 E09_TP04_Exponential E09     TP04       Exponential    90
#> # ℹ 26 more rows
print(annotation)
#> # A tibble: 4,162 × 2
#>    Gene_symbol  Gene_name                                              
#>    <chr>        <chr>                                                  
#>  1 LOC113838844 cone-rod homeobox protein-like                         
#>  2 Wdr83os      WD repeat domain 83 opposite strand                    
#>  3 Cubn         cubilin                                                
#>  4 Dynlt1       dynein light chain Tctex-type 1                        
#>  5 Ostc         oligosaccharyltransferase complex non-catalytic subunit
#>  6 Unc5cl       unc-5 family C-terminal like                           
#>  7 Cfl1         cofilin 1                                              
#>  8 LOC100752202 HEN methyltransferase 1                                
#>  9 LOC100755162 acyl-coenzyme A synthetase ACSM5, mitochondrial        
#> 10 LOC100768921 40S ribosomal protein S21                              
#> # ℹ 4,152 more rows

Two comments about the characteristics the input data should have:

  • The data must not contain any NA values or other special values, and must consist only of numbers. For example, the original proteomics data contained some NA values, which were resolved in this case by imputation (replacing NA values with numbers).

  • All features of the data should ideally be normally distributed when analyzed with limma, which fits a linear model to each feature. These models rely on statistical tests that assume normality. Although limma can still function if the data is not normally distributed, the resulting p-values may become less reliable. For this reason, it is strongly recommended to transform your data using techniques such as log2 transformation when the features deviate from normality. Proper transformation helps ensure that the assumptions underlying the statistical tests are met, leading to more accurate and trustworthy results.

Bring the Inputs into the Standardized Format

Since data is not in the format required by the SplineOmics package, it needs some processing. The SplineOmics package requires data to be a numeric matrix, so no element is allowed to be anything else than a number. This can be done with a few commands in R, but if your file has a specific structure, the function extract_data() can handle this automatically.

File Structure Requirements

If your file looks like the one used here, where:

  • The data matrix field is on the left
  • The annotation info is on the right
  • These fields are separated by one empty column

Usage of the extract_data() function

Then, extract_data() can:

  • Identify the data matrix field and return it as a numeric matrix.
  • Create column headers from the information written in the cells above the respective columns of the data matrix field.
  • Assign rowheaders:
    • If no annotation columns are specified, rowheaders will be increasing numbers.
    • If annotation columns are specified (like "First.Protein.Description" and "ID" in this example), these will be combined to form the rowheaders (feature names).

Usage in Plotting

The generated rowheaders will be used to label any plots where a feature is shown individually, such as:

  • Spline plots with the datapoints from an individual feature.
data <- SplineOmics::extract_data(
  # The dataframe with the numbers on the left and info on the right.
  data = data,
  # Use this annotation column for the feature names.
  feature_name_columns = c("Gene_name"),  
  # When TRUE, you must confirm that data is in the required format.
  user_prompt = FALSE   
  )

Perform EDA (exploratory data analysis)

Now that we have the data in the required format (numeric matrix) we can go on.

The first step in analyzing data is typically Exploratory Data Analysis (EDA). EDA involves summarizing the main characteristics of the data, often through visualizations.

Common EDA Plots

Some common types of EDA plots include:

  • Density distributions
  • Boxplots
  • PCA (Principal Component Analysis)
  • Correlation heatmaps

Again, you can generate those plots yourself with a few lines of R code. However, if you prefer, for convenience, the explore_data() function can handle this for you.

Using explore_data() for EDA

The SplineOmics package provides the function explore_data() to perform EDA. This function requires the following arguments:

  • data: The numeric data matrix.
  • meta: The metadata table.
  • condition: The name of the column in the metadata that contains the levels of the experiment (e.g., “Exponential” and “Stationary”).
  • report_info: A list that contains general information about the analysis, such as the name of the analyst and the datatype (e.g. proteomics)

Optional Arguments

In addition to the required arguments, explore_data() offers several optional arguments:

  • meta_batch_column: The name of the column that contains the first batch effect.

  • meta_batch2_column: The name of the column that contains the second batch effect.

    If at least one batch column is provided, the function will:

    • Use the removeBatchEffect() function from limma to remove the batch effect from the data before plotting.
    • Generate two EDA HTML reports: one for the uncorrected data and one for the batch-corrected data.

Output and Report Options

  • By default, the reports are saved in the current working directory, but this location can be changed using the report_dir argument.
  • The function also returns all plots generated during the analysis, so that you can modify them according to your own needs.
  • If you do not want a report to be generated, you can set the report argument to FALSE (when you for example just want the figures in the R environment)
# Those fields are mandatory, because we believe that when such a report is
# opened after half a year, those infos can be very helpful. 
report_info <- list(
  omics_data_type = "PTX",
  data_description = "Proteomics data of CHO cells",
  data_collection_date = "February 2024",
  analyst_name = "Thomas Rauter",
  contact_info = "thomas.rauter@plus.ac.at",
  project_name = "DGTX"
  )

report_dir <- here::here(
  "results",
  "explore_data"
  )

SplineOmics Object

In the SplineOmics package, multiple functions take the same arguments as input. To make this easier and to avoid errors, we decided that those arguments are not provided individually to the functions, but are all stored in an R6 object (which is of type ‘SplineOmics’) and then this object is passed to the functions. Additionally, some functions generate intermediate output, which is just necessary for the next function in the workflow, which is then also just passed along by updating the SplineOmics object. But you don’t have to worry about this.

Functionality

The SplineOmics object can be seen as a container where all necessary arguments are stored. Each function retrieves the required arguments from the object and potentially adds new data or results back into it.

Documentation

The documentation of the function that creates the SplineOmics object can be found here and the documentation of the function that updates it [here

The documentation for each function that takes the SplineOmics object as input specifies which arguments must be present in the SplineOmics object when it is passed to the respective function.

Required Arguments create_splineomics()

  • data: A matrix with the data
  • meta: Metadata associated with the data.
  • condition: Meta column name of the levels (e.g., Exponential and Stationary).

Optional Arguments create_splineomics()

  • rna_seq_data: An object containing the preprocessed RNA-seq data, such as the output from limma::voom function.
  • annotation: A dataframe with the feature descriptions of data.
  • report_info: A list containing general information about the analysis.
  • meta_batch_column: Column for meta batch information.
  • meta_batch2_column: Column for secondary meta batch information.
  • design: A limma design formula
  • spline_params: Parameters for the spline functions.
# splineomics now contains the SplineOmics object.
splineomics <- SplineOmics::create_splineomics(
  data = data,
  meta = meta,
  annotation = annotation,
  report_info = report_info,
  condition = "Phase",  # Column of meta that contains the levels.
  meta_batch_column = "Reactor"  # For batch effect removal
)

# Special print.SplineOmics function leads to selective printing
print(splineomics)   
#> data:SplineOmics Object
#> -------------------
#> Number of features (rows): 4162 
#> Number of samples (columns): 36 
#> Meta data columns: 5 
#> First few meta columns:
#> # A tibble: 3 × 5
#>   Sample.ID            Reactor Time.Point Phase        Time
#>   <chr>                <chr>   <chr>      <chr>       <dbl>
#> 1 E09_TP01_Exponential E09     TP01       Exponential   -60
#> 2 E10_TP01_Exponential E10     TP01       Exponential   -60
#> 3 E12_TP01_Exponential E12     TP01       Exponential   -60
#> Condition: Phase 
#> No RNA-seq data provided.
#> Annotation provided with 4162 entries.
#> No spline parameters set.
#> P-value adjustment method: BH

Now that we have the SplineOmics object defined, we can perform our exploratory data analysis.

plots <- SplineOmics::explore_data(
  splineomics = splineomics,   # SplineOmics object
  report_dir = report_dir
  )

Here you can see the HTML report of the explore_data() function with the NOT batch-corrected data, and here the report for the batch-corrected data.

The EDA plots can tell you a range of things. The plots in the HTML report are grouped into three categories: Distribution and Variability Analysis, Time Series Analysis, and Dimensionality Reduction and Clustering.

If you look at the correlation heatmaps in the HTML report, you can see that the samples E12_TP05_Exponential and E10_TP10_Stationary stick out. Seeing this, you might want to remove them from the data. You can test out what happens when you do this, along with testing how other hyperparameter choices influence the results, with the package function screen_limma_hyperparams().

Finding the Best Hyperparameters

Before running the limma spline analysis, it is important to find the best “hyperparameters”. In this context, hyperparameters include:

  • Degree of freedom (DoF)
  • Different versions of the data (e.g., outlier removed vs. not removed)
  • Different limma design formulas

Challenge of Hyperparameter Selection

Rationally determining the best combination of hyperparameters can be very challenging. By rationally, I mean deciding upon the final hyperparameters without ever testing any, just by scientific reasoning. It is much easier just testing a few and seeing how they actually behave. However, manually selecting combinations can be tedious, and you have to work very systematically, which can be challenging. To solve this problem, the screen_limma_hyperparams() function was written.

Using screen_limma_hyperparams()

The function screen_limma_hyperparams() automates the process of testing different combinations of hyperparameters. Here’s how it works:

  • Specify values: For each hyperparameter, you can specify all the values you want to test.
  • Run combinations: The function runs the limma spline analysis with combinations formed from the hyperparameters you’ve provided in a semi combinatorial way.

Inner vs. Outer Hyperparameters

Semi combinatorial here means that not every possible combination is generated. Instead, there are inner and outer hyperparameters:

  • Outer hyperparameters: These include things like different versions of the dataset (e.g., full dataset vs. dataset with outliers removed).
    • All possible combinations of outer hyperparameters are generated.
  • Inner hyperparameters: These include adjusted p-value thresholds and spline parameters (e.g., degree of freedom).
    • For each version of the data (outer hyperparameter), all combinations of inner hyperparameters are tested.

This approach is neccessary, because otherwise the amount of combos would explode.

Example

For example, if you have two versions of a dataset (one full dataset, and one with some outliers removed), these versions are considered outer hyperparameters. Additionaly, lets say, you want to test two different limma design formulas, formula 1 and 2. The function will test out all combinations of those outer hyperparameters and compare them with each other, which results in a total of 6 combinations here:

  • Full Dataset Formula 1 vs Full Dataset Formula 2

  • Full Dataset Formula 1 vs Outliers Removed Dataset Formula 1

  • Full Dataset Formula 1 vs Outliers Removed Dataset Formula 2

  • Full Dataset Formula 2 vs Outliers Removed Dataset Formula 1

  • Full Dataset Formula 2 vs Outliers Removed Dataset Formula 2

  • Outliers Removed Dataset Formula 1 vs Outliers Removed Dataset Formula 2

Let’s say you specified the following inner hyperparameters:

  • Spline parameters: Natural cubic splines with a degree of freedom of either 2 or 3.
  • Adjusted p-value threshold: 0.05 or 0.1.

The function will generate and test all combinations of the spline parameters and p-value thresholds for all 4 combos:

Combo 1:

  • DoF = 2, threshold = 0.05

  • DoF = 3, threshold = 0.05

  • DoF = 2, threshold = 0.1

  • DoF = 3, threshold = 0.1

Combo 2:

  • DoF = 2, threshold = 0.05

  • DoF = 3, threshold = 0.05

  • DoF = 2, threshold = 0.1

  • DoF = 3, threshold = 0.1

Combo 3: …

This allows you to systematically explore different combinations and select the optimal hyperparameters for your analysis.

Below is an example for our proteomics data:

data1 <- data 
meta1 <- meta

# Remove the "outliers"
data2 <- data[, !(colnames(data) %in% c(
  "E12_TP05_Exponential", 
  "E10_TP10_Stationary"
  )
  )]

# Adjust meta so that it matches data2
meta2 <- meta[!meta$Sample.ID %in% c(
  "E12_TP05_Exponential", 
  "E10_TP10_Stationary"
  ), ]

# As mentioned above, all the values of one hyperparameter are stored 
# and provided as a list.
datas <- list(data1, data2) 

# This will be used to describe the versions of the data.
datas_descr <- c(
  "full_data",
  "outliers_removed"
  ) 

metas <- list(meta1, meta2) 

# Test two different limma designs
designs <- c(
  "~ 1 + Phase*X + Reactor",
  "~ 1 + X + Reactor"
  ) 

# 'Integrated means' limma will use the full dataset to generate the results for
# each condition. 'Isolated' means limma will use only the respective part of 
# the dataset for each condition. Designs that contain the condition column 
# (here Phase) must have mode 'integrated', because the full data is needed to
# include the different conditions into the design formula.
modes <- c(
  "integrated",
  "isolated"
  )

# Specify the meta "level" column
condition <- "Phase" 

report_dir <- here::here(
  "results",
  "hyperparams_screen_reports"
  ) 

# To remove the batch effect
meta_batch_column = "Reactor" 

# Test out two different p-value thresholds (inner hyperparameter)
adj_pthresholds <- c(
  0.05,
  0.1
  )

# Create a dataframe with combinations of spline parameters to test
# (every row a combo to test)
spline_test_configs <- data.frame(
   # 'n' stands for natural cubic splines, b for B-splines.
  spline_type = c("n", "n", "b", "b"),  
  # Degree is not applicable (NA) for natural splines.
  degree = c(NA, NA, 2L, 4L),           
  # Degrees of freedom (DoF) to test.
  # Higher dof means spline can fit more complex patterns.
  dof = c(2L, 3L, 3L, 4L)         
)

print(spline_test_configs)
#>   spline_type degree dof
#> 1           n     NA   2
#> 2           n     NA   3
#> 3           b      2   3
#> 4           b      4   4

Now that we specified all the values for each hyperparameter that we want to test, we can run the screen_limma_hyperparams() function.

SplineOmics::screen_limma_hyperparams(
  splineomics = splineomics,
  datas = datas,
  datas_descr = datas_descr,
  metas = metas,
  designs = designs,
  modes = modes,
  spline_test_configs = spline_test_configs,
  report_dir = report_dir,
  adj_pthresholds = adj_pthresholds,
  )

As mentioned, this function generates a report for each comparison of the outer hyperparameters, which are too many to show here. You can view an example report here

This report contains the results for the comparison of the “outer” hyperparameters data 1 and design (formula) 1 compared against data 1 and design 2. For both of those, all combinations of the “inner” hyperparameters are generated (every possible combination of all specified adj. p-value thresholds and spline configs).

The encoding used in the reports and the titles is here (This is part of the output of the screen_limma_hyperparams function).

Run limma spline analysis

Once we identified the hyperparameters that are likely the best ones, we can run the limma spline analysis with them and get the results.

Lets just assume for now that the new parameters, with which the SplineOmics object is updated, are the best for this analysis. The choice depends on the analysis. For example, for this analysis, natural cubic splines (n) with a dof of two seemed to fit the data best (not overfitting, but also not underfitting), which was the reason those spline parameters were chosen.

For the design formula, you must specify either ‘isolated’ or ‘integrated’. Isolated means limma determines the results for each level using only the data from that level. Integrated means limma determines the results for all levels using the full dataset (from all levels). For the integrated mode, the condition column (here Phase) must be included in the design. Isolated means that limma only uses the part of the dataset that belongs to a level to obtain the results for that level.

To generate the limma result categories 2 and 3 ()

splineomics <- SplineOmics::update_splineomics(
  splineomics = splineomics,
  design = "~ 1 + Phase*X + Reactor",  # best design formula
  mode = "integrated",  # means limma uses the full data for each condition.
  data = data2,   # data without "outliers" was better
  meta = meta2,  
  spline_params = list(
    spline_type = c("n"),   # natural cubic splines (take these if unsure)
    dof = c(2L)  # If you are unsure about which dof, start with 2 and increase
    )
)

Run the run_limma_splines() function with the updated SplineOmics object:

splineomics <- SplineOmics::run_limma_splines(
  splineomics = splineomics
  )
#> Column 'Reactor' of meta will be used to remove the batch effect for the plotting
#> Info limma spline analysis completed successfully

The output of the function run_limma_splines() is a named list, where each element is a specific “category” of results. Refer to this document for an explanation of the different result categories. Each of those elements is a list, containing as elements the respective limma topTables, either for each level or each comparison between two levels.

The element “time_effect” is a list, where each element is the topTable where the p-value for each feature for the respective level are reported.

The element “avrg_diff_conditions” is a list that contains as elements the topTables, that represent the comparison of the average differences of the levels.

The element “interaction_condition_time” is a list that contains as elements the topTables, that represent the interaction between the levels (which includes both time and the average differences)

Build limma report

The topTables of all three limma result categories can be used to generate p-value histograms an volcano plots.

report_dir <- here::here(
  "results",
  "create_limma_reports"
  )

plots <- SplineOmics::create_limma_report(
  splineomics = splineomics,
  report_dir = report_dir
  )

You can view the generated analysis report of the create_limma_report function here.

This report contains p-value histograms for all three limma result categories and a volcano plot for category 2. Embedded in this file are the downloadable limma topTables with the results for category 1 if the mode was ‘isolated’ and also with the results of category 2 and 3 if the mode was ‘integrated’.

Note that in the upcoming cluster_hits() function report, the embedded file will only contain the clustered significant features from result category 1.

Cluster the hits (significant features)

After we obtained the limma spline results, we can cluster the hits based on their temporal pattern (their spline shape). We define what a hit is by setting an adj. p-value threshold for every level. Hits are features (e.g. proteins) that have an adj. p-value below the threshold. Hierarchical clustering is used to place every hit in one of as many clusters as we have specified for that specific level.

adj_pthresholds <- c(  # 0.05 for both levels
  0.05,  # exponential
  0.05   # stationary
  )

clusters <- c(
  6L,  # 6 clusters for the exponential phase level
  3L   # 3 clusters for the stationary phase level
  )

report_dir <- here::here(
  "results",
  "clustering_reports"
  )

plot_info = list(  # For the spline plots
  y_axis_label = "log2 intensity",
  time_unit = "min",  # our measurements were in minutes
  treatment_labels = list("feeding"),  # add this for all conditions
  treatment_timepoints = list(0)  # Feeding was at 0 minutes.
)

# Like this you can add individual treatment labels to your plots:
# treatment_labels = list(
#   exponential = "treatment 1", # One treatment in exp
#   stationary = c("treatment 2", "treatment 3")  # Two treatments in stat
#   additional_condition = NA  # No treatment in the hypothetical third condition
#   ) 
# 
# treatment_timepoints = list(
#   exponential = 0,
#   stationary = C(100, 140),  # Two treatments also need two timepoints
#   additional_condition = NA
#   )  
# 
# or set a treatment for ALL conditions (still always make a list):
#
# treatment_labels = list("treatment")
# treatment_timepoints = list(120)
#
# or set multiple treatments for ALL conditions:
#
# treatment_labels = list(c("treatment1", "treatment2"))
# treatment_timepoints = list(c(120, 90))


# Get all the gene names. They are used for generating files
# which contents can be directly used as the input for the Enrichr webtool,
# if you prefer to manually perform the enrichment. Those files are
# embedded in the output HTML report and can be downloaded from there.
gene_column_name <- "Gene_symbol"
genes <- annotation[[gene_column_name]]

plot_options = list(
  # When meta_replicate_column is not there, all datapoints are blue.
  meta_replicate_column = "Reactor",  # Colors the data points based on Reactor
  cluster_heatmap_columns = FALSE   # Per default FALSE, just for demonstration
)

clustering_results <- SplineOmics::cluster_hits(
  splineomics = splineomics,
  adj_pthresholds = adj_pthresholds,
  clusters = clusters,
  genes = genes,
  plot_info = plot_info,
  plot_options = plot_options,
  report_dir = report_dir,
  adj_pthresh_avrg_diff_conditions = 0,
  adj_pthresh_interaction_condition_time = 0.25
  )

You can view the generated analysis report of the cluster_hits function here.

As discussed before, there are three limma result categories. The cluster_hits() report shows the results of all three, if they are present (category 2 and 3 can only be generated when the design formula contains an interaction effect).

Perform gene set enrichment analysis (GSEA)

Usually, the final step in such a bioinformatics analysis is GSEA. To each clustered hit, the respective gene can be assigned and GSEA performed. For this, the Enrichr databases of choice have to be downloaded:

# Specify which databases you want to download from Enrichr
gene_set_lib <- c(
  "WikiPathways_2019_Human",
  "NCI-Nature_2016",
  "TRRUST_Transcription_Factors_2019",
  "MSigDB_Hallmark_2020",
  "GO_Cellular_Component_2018",
  "CORUM",
  "KEGG_2019_Human",
  "TRANSFAC_and_JASPAR_PWMs",
  "ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X",
  "GO_Biological_Process_2018",
  "GO_Molecular_Function_2018",
  "Human_Gene_Atlas"
  )

SplineOmics::download_enrichr_databases(
  gene_set_lib = gene_set_lib,
  filename = "databases.tsv")

Per default the file is placed in the current working directory, which is the root dir of the R project.

To run GSEA, the downloaded database file has to be loaded as a dataframe. Further, optionally, the clusterProfiler parameters and the report dir can be specified. The function create_gsea_report() runs GSEA using clusterProfiler, generates an HTML report and returns the GSEA dotplots in R.

# Specify the filepath of the TSV file with the database info
downloaded_dbs_filepath <- here::here("databases.tsv")

# Load the file
databases <- read.delim(
  downloaded_dbs_filepath,
  sep = "\t",
  stringsAsFactors = FALSE
)

# Specify the clusterProfiler parameters
clusterProfiler_params <- list(
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  minGSSize = 10,
  maxGSSize = 500,
  qvalueCutoff = 0.2
  )

report_dir <- here::here(
  "results",
  "gsea_reports"
  )

The function below runs the clusterProfiler for all clusters and all levels, and generates the HTML report:

result <- SplineOmics::run_gsea(
  # A dataframe with three columns: feature, cluster, and gene. Feature contains
  # the integer index of the feature, cluster the integer specifying the cluster
  # number, and gene the string of the gene, such as "CLSTN2".
  levels_clustered_hits = clustering_results$clustered_hits_levels,
  databases = databases,
  clusterProfiler_params = clusterProfiler_params,
  report_info = report_info,
  report_dir = report_dir
  )

You can view the generated analysis report of the cluster_hits function here.

This report first shows all enrichment results, where more than 2 genes supported a term, in a tabular format. The table with all the terms with < 2 genes supporting it can be downloaded by clicking on a button below that table.

For the dotplots below that, every row is a term from a specific database, and the columns are the respective clusters. The color scale contains the info about the odds ratio and the size the -log10 adj. p-value. Only terms that have > 2 genes as support are included in the plot. Further, for each cluster, just maximally 5 terms are shown (the terms with the highest odds ratios). Note that when for example cluster 1 already has 5 terms, and cluster 2 does not, and gets a term which was also found for cluster 1, than this term would be included as the sixth term for cluster 1, so this is a way the maximum of 5 can be exceeded.

If a phase, like stationary here, does not lead to any enrichment results, that is stated with a red message.

Conclusion

This example showed most functionalities of the SplineOmics package. You can also run other datatypes with it, including timeseries RNA-seq and glycan data (for those, refer to the documentation in the README file on the GitHub page under Usage/RNA-seq and Glycan Data).

To get an interactive version of this example, download the SplineOmics package and run the function open_tutorial() which opens an R Markdown file, where you can run the different code blocks and if your are working in R Studio (which is recommendet) you can easily check out the values of the individual variables and generate the output reports yourself.

When you run the function open_template() you get a minimal R Markdown file, where the code is written so that you can use it as a skeleton to plug in your own data and run it.

We hope that the SplineOmics package makes your scientific data analysis easier. If you face any problems (bugs in the code) or are not satisfied with the documentation, open an issue on GitHub or check out the other options under the Feedback section of the README on GitHub. Thank you!