Performs clustering on hits from top tables generated by differential expression analysis. This function filters hits based on adjusted p-value thresholds, extracts spline coefficients for significant features, normalizes these coefficients, and applies hierarchical clustering. The results, including clustering assignments and normalized spline curves, are saved in a specified directory and compiled into an HTML report.
Usage
cluster_hits(
splineomics,
nr_clusters,
adj_pthresholds = c(0.05),
adj_pthresh_avrg_diff_conditions = 0.05,
adj_pthresh_interaction_condition_time = 0.05,
min_effect_size = list(time_effect = 0, avg_diff_cond = 0, interaction_cond_time = 0),
genes = NULL,
plot_info = list(y_axis_label = "Value", time_unit = "min", treatment_labels = NA,
treatment_timepoints = NA),
plot_options = list(cluster_heatmap_columns = FALSE, meta_replicate_column = NULL),
raw_data = NULL,
report_dir = NULL,
max_hit_number = 25,
verbose = TRUE
)Arguments
- splineomics
An S3 object of class
SplineOmicsthat contains all the necessary data and parameters for the analysis, including:data: The data matrix with the values. The columns are the samples (timepoint + replicate combo) and the rows are the features (e.g. genes or proteins).meta: A dataframe containing metadata corresponding to thedata, must include a 'Time' column and any columns specified byconditions. In general, the columns of meta correspond to the different types of metadata, and each row corresponds to a column of data (contains the metadata for that sample).annotation: A dataframe that maps the rows ofdatato annotation info, such as the gene name or database identifiers.report_info: A named list describing the experiment. Must include the following fields: -"omics_data_type"-"data_description"-"data_collection_date"-"analyst_name"-"contact_info"-"project_name"May also include the following optional fields: -
"method_description"-"results_summary"-"conclusions"design: A character of length 1 representing the limma design formula.mode: Specifies how the design formula is constructed: either"isolated"or"integrated"."isolated": Each level is analyzed independently, using only the subset of data corresponding to that level. The design formula does not include the condition variable, since only one condition is present in each subset."integrated": All levels are analyzed together in a single model, using the full dataset. The design formula includes the condition variable (and optionally interaction terms with it) so that results are estimated jointly across all levels.condition: Character vector of length 1 specifying the column name inmetaused to define groups for analysis.spline_params: A list of spline parameters for the analysis.meta_batch_column: A character string specifying the column name in the metadata used for batch effect removal.meta_batch2_column: A character string specifying the second column name in the metadata used for batch effect removal.limma_splines_result: A list of data frames, each representing a top table from differential expression analysis, containing at least 'adj.P.Val' and expression data columns.feature_name_columns: Character vector of strings that each specify a column of the original data dataframe which were used to automatically build the feature names with theextract_datafunction.
- nr_clusters
Named list specifying the number of clusters per condition level. The list must have one element per condition level, and each element must be named exactly with the corresponding condition name (e.g.,
"condition1","condition2").Each element's value controls the
kused by k-means for that level:Single integer (
k > 0, e.g.3): use exactly that many clusters.Integer range (written with
:, e.g.2:6): choose thekwithin the range that minimizes the Bayesian Information Criterion (BIC) computed from the k-means fit over that range (lower is better). Ties are broken by the first minimum encountered.
Notes
All condition levels must be present exactly once as names.
Values must be positive integers; ranges must be increasing (e.g.
2:6).BIC is computed from k-means using Euclidean distance. A common form is \(\mathrm{BIC} = n \log(\mathrm{WCSS}/n) + k \log(n)\, p\), where \(n\) is the number of series, \(p\) the number of timepoints (features), and \(\mathrm{WCSS}\) the total within-cluster sum of squares for the fit.
Clustering will fail if the requested
kis not strictly less than the number of available series for that level.
Example Fixed k for condition1, BIC-selected k for condition2 nr_clusters <- list( condition1 = 4, condition2 = 2:6 )
- adj_pthresholds
Numeric vector of p-value thresholds for filtering hits in each top table. The order of the elements determines which adj.p-value threshold is assigned to which condition (the first element gets assigned to the first condition, the second to the second, etc.).
- adj_pthresh_avrg_diff_conditions
p-value threshold for the results from the average difference of the condition limma result.
- adj_pthresh_interaction_condition_time
p-value threshold for the results from the interaction of condition and time limma result.
- min_effect_size
A named list that specifies the minimum effect size thresholds to consider a feature as biologically meaningful, in addition to statistical significance. This allows users to filter out "trivial" hits that pass adjusted p-value cutoffs but show negligible effect sizes.
The list must contain the following elements:
time_effect: Minimum cumulative travel for time effects (Category 1). Features with a smaller travel will be ignored even if significant.avg_diff_cond: Minimum absolute effect size for average differences between conditions (Category 2). Ensures that only contrasts with a relevant magnitude are reported.interaction_cond_time: Minimum effect size for the interaction between condition and time (Category 3). This controls how large the differential curve travel must be across conditions to count as a hit.
Values should be numeric scalars (typically >0). For example:
min_effect_size = list(time_effect = 1, avg_diff_cond = 1, interaction_cond_time = 2)will only keep features with cumulative travels or condition-time differences above those cutoffs. Use smaller values (e.g., 0.1) for permissive filtering, or larger values for more conservative thresholds.The default is 0 for all three elements.
- genes
A character vector of gene names corresponding to the features to be analyzed. The order of entries must match the feature order in
data. Gene names should be standardized (cleaned) to ensure compatibility with downstream databases used for overrepresentation analysis after clustering.- plot_info
List with optional elements used to annotate spline plots:
y_axis_label: single string for the y-axis label.time_unit: single string used in the x-axis label.treatment_labels: named list of single strings.treatment_timepoints: named list of single numeric values.
If any treatment list is present, both must be present. The two lists must have identical name sets. Allowed names are the values of
meta[[condition]]and the special name"double_spline_plots", which generates a treatment line for the plots of limma category 2 and 3 (average difference between conditions and the interaction between condition and time).Vertical dashed lines are drawn at the given timepoints for facets whose level name matches a list name, and labeled with the corresponding string (e.g., feeding, temperature shift).
Example:
- plot_options
A named list controlling optional plot customization. The list can include one or both of the following entries (any not supplied will fall back to their default values):
cluster_heatmap_columns(logical, default =FALSE): Whether to cluster the columns in the heatmap.meta_replicate_column(character(1), default =NULL): Name of the column inmetathat encodes replicate information. If supplied, spline plot data points are colored by replicate, allowing replicate-level variation to be assessed.
- raw_data
Data matrix with the raw (unimputed) data, still containing NA values. When provided, it highlights the datapoints in the spline plots that originally where NA and that were imputed.
- report_dir
Character string specifying the directory path where the HTML report and any other output files should be saved. When no path is specified, then the function runs but no HTML report is generated.
- max_hit_number
Maximum number of hits which are plotted within each cluster. This can be used to limit the computation time and size of the HTML report in the case of many hits.
- verbose
Boolean flag controlling the display of messages.
Value
A named list with three elements:
cluster_tableA tibble containing one row per
feature_nrwith metadata and cluster assignments across the analysis categories. The structure is:feature_nr- Numeric feature identifier.feature_name- Preferred feature name, prioritizing values from the limma tables, then from the cluster table row names, and falling back to the numeric feature ID.gene- Preferred gene symbol from theannotationtable if available, otherwise taken from the cluster tables.cluster_<cond1>/cluster_<cond2>- Cluster assignments for each time-effect condition, named according to the elements ofclustered_hits_levels.cluster_cat2- Present only if category 2 results are available; a combined cluster label in the form"<cluster_<cond1>>_<cluster_<cond2>>"for features that are significant in category 2. If this value isNA, the feature was not a category 2 hit.cluster_cat3- Present only if category 3 results are available; a combined cluster label in the form"<cluster_<cond1>>_<cluster_<cond2>>"for features that are significant in category 3. If this value isNA, the feature was not a category 3 hit.
For any category-specific cluster column (
cluster_<cond1>,cluster_<cond2>,cluster_cat2,cluster_cat3), a value ofNAindicates that the feature was not significant (not a hit) in that category.spline_resultsA named list summarizing the fitted spline trajectories, their shared time grid, and effect-size based significance flags. Structure:
time_gridNumeric vector of length \(T\) giving the common time points (e.g., hours since cultivation start) on which all splines were predicted.
predictionsNamed list by condition (e.g.,
constant,temp_shift). Each entry is a numeric matrix of size \(N \times T\) with rows corresponding to features and columns totime_grid. Values are the predicted spline trajectories on the absolute scale used in the analysis (e.g., log2-CPM aftervoom). Row order matches the feature order used throughout the analysis.time_effect_effect_sizeNamed list by condition with a numeric vector (length \(N\)) per condition giving the cumulative travel (integrated temporal change) of each feature`s spline across
time_grid. Larger values indicate stronger within-condition temporal modulation.time_effect_passed_thresholdNamed list by condition with a logical vector (length \(N\)) per condition indicating whether the corresponding
time_effect_effect_sizeexceeds the user-defined effect-size threshold (i.e., time-effect hits).interaction_effect_sizeNumeric vector (length \(N\)) giving the differential cumulative travel between the two condition-specific splines of each feature, computed on the same
time_grid. Larger values indicate stronger differences in temporal behaviour between conditions (condition-time interaction).interaction_passed_thresholdLogical vector (length \(N\)) indicating whether
interaction_effect_sizeexceeds the interaction effect-size threshold (i.e., features with significantly different temporal profiles across conditions).
Unless stated otherwise, vectors are aligned to the same feature order used in the prediction matrices; condition names match
levels(meta[[condition]]).plotsA list of all plots generated during the run, corresponding to the visualizations shown in the HTML report produced by this function. Additionally, this plots list also contains the plots showing the consensus clusters of the potential clustering of the interaction of condition and time (category 3) hits.
Examples
# Toy data: 4 features x 6 samples (two conditions, three time points)
toy_data <- matrix(
c(
3, 5, 8, 12, 17, 23, # f1
23, 17, 13, 9, 6, 4, # f2
5, 3, 2, 2, 3, 5, # f3
1, 4, 9, 8, 4, 1, # f4
10, 10, 10, 10, 10, 10, # f5
2, 2, 2, 9, 12, 15, # f6
4, 5, 7, 10, 14, 19, # f7
12, 11, 9, 8, 9, 12 # f8
),
nrow = 8, ncol = 6, byrow = TRUE,
dimnames = list(paste0("f", 1:8), paste0("s", 1:6))
)
toy_meta <- data.frame(
Time = c(0, 1, 2, 0, 1, 2),
condition = rep(c("WT", "KO"), each = 3),
Replicate = rep(c("R1", "R2"), each = 3),
row.names = colnames(toy_data),
stringsAsFactors = FALSE
)
toy_annot <- data.frame(
feature_nr = 1:8,
gene = c("G1", "G2", "G3", "G4"),
stringsAsFactors = FALSE
)
# Stub limma "top tables" with minimal required fields
# (feature_nr + adj.P.Val)
tt_wt <- data.frame(feature_nr = 1:4, adj.P.Val = c(0.01, 0.20, 0.04, 0.60))
tt_ko <- data.frame(feature_nr = 1:4, adj.P.Val = c(0.50, 0.03, 0.70, 0.02))
tt_c2 <- data.frame(feature_nr = 1:4, adj.P.Val = c(0.04, 0.70, 0.80, 0.90))
tt_c3 <- data.frame(feature_nr = 1:4, adj.P.Val = c(0.20, 0.90, 0.03, 0.80))
design_str <- "~ 1 + Time*condition"
# Minimal spline parameters required by spline machinery
spline_params <- list(
spline_type = "n", # natural cubic splines
dof = 1L # degrees of freedom for the spline basis
)
toy_splineomics <- list(
data = toy_data,
meta = toy_meta,
annotation = toy_annot,
report_info = list(
omics_data_type = "RNA-seq",
data_description = "toy example",
data_collection_date = "2025-01-01",
analyst_name = "Example",
contact_info = "example@example.org",
project_name = "ToyProject"
),
design = design_str,
mode = "integrated",
condition = "condition",
spline_params = spline_params,
meta_batch_column = NULL,
meta_batch2_column = NULL,
limma_splines_result = list(
time_effect = list(WT = tt_wt, KO = tt_ko),
avrg_diff_conditions = tt_c2,
interaction_condition_time = tt_c3
),
feature_name_columns = "gene"
)
class(toy_splineomics) <- "SplineOmics"
toy_splineomics <- run_limma_splines(toy_splineomics)
#>
#> Ensure the design has no Condition*Time interaction for mode == 'isolated', and includes it for mode == 'integrated'.
#> Running Levene's test both feature wise and sample wise to implicitly decide whether to use the limma array weights or not.
#> No random effects: fitting model with lmFit()...
#> Running feature wise Levene's test...
#>
#> ------------------------------------------------------------
#> Fraction of features violating homoscedasticity
#> (p < 0.050): 0.00% (0/8 features)
#> No violating features found.
#> ------------------------------------------------------------
#> Running Levene's test across samples to detect inter-sample variance differences...
#> Levene's test p-value (sample-level): 0.1425
#> ✅ No strong evidence of inter-sample variance differences.
#> ------------------------------------------------------------
#> ✅ No strong evidence for heteroscedasticity.
#> Proceeding WITHOUT using robust strategy.
#> ------------------------------------------------------------
#>
#> Fitting global model...
#> Info Finished limma spline analysis in 0.0 min
# Clustering configuration: fixed k per condition
nr_k <- list(WT = 2L, KO = 2L)
# Keep outputs light and write into a temporary directory
out <- cluster_hits(
splineomics = toy_splineomics,
nr_clusters = nr_k,
adj_pthresholds = c(0.05, 0.05),
adj_pthresh_avrg_diff_conditions = 0.05,
adj_pthresh_interaction_condition_time = 0.05,
min_effect_size = list(
time_effect = 0,
avg_diff_cond = 0,
interaction_cond_time = 0
),
genes = toy_annot$gene,
plot_info = list(y_axis_label = "log2 expression", time_unit = "h"),
plot_options = list(
cluster_heatmap_columns = FALSE,
meta_replicate_column = "Replicate"
),
raw_data = toy_data,
report_dir = tempdir(),
max_hit_number = 2
)
#>
#> Ensure the design has no Condition*Time interaction for mode == 'isolated', and includes it for mode == 'integrated'.
#>
#> Performing the clustering...
#> For the level: KO
#> For the level: WT
#> Generating heatmap...
#> Generating cluster mean splines for level: WT
#> Generating spline plots...
#> Generating cluster mean splines for level: KO
#> Generating spline plots...
#> Generating report. This takes a few seconds.
#>
#> Info Clustering the hits completed successfully.
#> Your HTML reports are located in the directory: /tmp/RtmpKGGksk .
#> Please note that due to embedded files, the reports might be flagged as
#> harmful by other software. Rest assured that they provide no harm.
#> 5 clusters for the condition effect
#> (interaction between condition and time)
#> Running this function took 0.2 min