Cluster the significant features based on spline similarity
Source:R/cluster_hits.R
cluster_hits.RdPerforms kmeans clustering of significant features from top tables generated by limma.
Usage
cluster_hits(
splineomics,
nr_clusters,
adj_pthresh_time_effect = 0.05,
adj_pthresh_avrg_diff_conditions = 0.05,
adj_pthresh_interaction_condition_time = 0.05,
min_effect_size = list(time_effect = 0, avg_diff_cond = 0, interaction_cond_time = 0),
min_cluster_r2 = 0,
genes = NULL,
verbose = FALSE
)Arguments
- splineomics
SplineOmics: An S3 object of classSplineOmicsthat contains all the necessary data and parameters for the analysis, including:data:matrixThe data matrix with the values. The columns are the samples (timepoint + replicate combo) and the rows are the features (e.g. genes or proteins).meta:data.frameA dataframe containing metadata corresponding to thedata, must include a 'Time' column and any columns specified byconditions. In general, the columns of meta correspond to the different types of metadata, and each row corresponds to a column of data (contains the metadata for that sample).annotation:data.frameA dataframe that maps the rows ofdatato annotation info, such as the gene name or database identifiers.report_info:listA named list describing the experiment. Must include the following fields (character(1)): -"omics_data_type"-"data_description"-"data_collection_date"-"analyst_name"-"contact_info"-"project_name"May also include the following optional fields (
character(1)): -"method_description"-"results_summary"-"conclusions"design:character(1)A character of length 1 representing the limma design formula.mode:character(1)Specifies how the design formula is constructed: either"isolated"or"integrated"."isolated": Each level is analyzed independently, using only the subset of data corresponding to that level. The design formula does not include the condition variable, since only one condition is present in each subset."integrated": All levels are analyzed together in a single model, using the full dataset. The design formula includes the condition variable (and optionally interaction terms with it) so that results are estimated jointly across all levels.condition:character(1)Character vector of length 1 specifying the column name inmetaused to define groups for analysis.spline_params:listA list of spline parameters for the analysis.meta_batch_column:character(1)A character string specifying the column name in the metadata used for batch effect removal.meta_batch2_column:character(1)A character string specifying the second column name in the metadata used for batch effect removal.limma_splines_result:list(data.frame)andlist(list(data.frame))A list containing dataframes or lists of dataframes. Each df represents a top table from the limma spline analysis of the run_limma_splines() function that has to be run before.feature_name_columns:character()Character vector of strings that each specify a column of the original data dataframe which were used to automatically build the feature names with theextract_datafunction.
- nr_clusters
list: Named list specifying the number of clusters per condition level. The list must have one element per condition level, and each element must be named exactly with the corresponding condition name (e.g.,"condition1","condition2").Each element's value controls the
kused by k-means for that level:Single integer (
integer(1),k > 0, e.g.3): use exactly that many clusters.Integer range (
integer(), written with:, e.g.2:6): choose thekwithin the range that minimizes the Bayesian Information Criterion (BIC) computed from the k-means fit over that range (lower is better). Ties are broken by the first minimum encountered.
Notes
All condition levels must be present exactly once as names.
Values must be positive integers; ranges must be increasing (e.g.
2:6).BIC is computed from k-means using Euclidean distance. A common form is \(\mathrm{BIC} = n \log(\mathrm{WCSS}/n) + k \log(n)\, p\), where \(n\) is the number of series, \(p\) the number of timepoints (features), and \(\mathrm{WCSS}\) the total within-cluster sum of squares for the fit.
Clustering will fail if the requested
kis not strictly less than the number of available series for that level.
Example Fixed k for condition1, BIC-selected k for condition2 nr_clusters <- list( condition1 = 4, condition2 = 2:6 )
- adj_pthresh_time_effect
numeric(1): adj. p-value threshold for the limma time effect results (category 1).- adj_pthresh_avrg_diff_conditions
numeric(1): adj. p-value threshold for the limma average difference between conditions results (category 2).- adj_pthresh_interaction_condition_time
numeric(1): adj. p-value threshold for the limma interaction of condition and time results (category 3).- min_effect_size
list: A named list that specifies the minimum effect size thresholds to consider a feature as biologically meaningful, in addition to statistical significance. This allows users to filter out "trivial" hits that pass adjusted p-value cutoffs but show negligible effect sizes.The list must contain the following elements:
time_effect:numeric(1)Minimum cumulative travel for time effects (Category 1). Features with a smaller travel will be ignored even if significant.avg_diff_cond:numeric(1)Minimum absolute effect size for average differences between conditions (Category 2). Ensures that only contrasts with a relevant magnitude are reported.interaction_cond_time:numeric(1)Minimum effect size for the interaction between condition and time (Category 3). This controls how large the differential curve travel must be across conditions to count as a hit.
Values should be numeric scalars (typically >0). For example:
min_effect_size = list(time_effect = 1, avg_diff_cond = 1, interaction_cond_time = 2)will only keep features with cumulative travels or condition-time differences above those cutoffs. Use smaller values (e.g., 0.1) for permissive filtering, or larger values for more conservative thresholds.The default is 0 for all three elements.
- min_cluster_r2
numeric(1): A value >=0 and <1 specifying the minimum allowed squared Pearson correlation (r2) of any cluster member to the centroid of its assigned cluster.After k-means clustering, each cluster is iteratively pruned. Members whose r2 to the current cluster centroid falls below
min_cluster_r2are removed from that cluster and reassigned to cluster0(the "other" cluster).Pruning is repeated until all remaining members satisfy the constraint, a maximum of 10 iterations is reached, or the cluster would shrink below a minimum size of 10 members. If a cluster cannot satisfy the constraint under these conditions, all of its remaining members are reassigned to cluster
0.Cluster
0therefore contains all features that do not meet the minimum centroid similarity criterion. These features are retained in the output but are not considered part of any valid cluster.- genes
character(): A character vector of gene names corresponding to the features to be analyzed. The order of entries must match the feature order indata. Gene names should be standardized (cleaned) to ensure compatibility with downstream databases used for overrepresentation analysis after clustering.- verbose
logical(1): Boolean flag controlling the display of messages.
Value
A named list with three elements:
cluster_tableA tibble containing one row per
feature_nrwith metadata and cluster assignments across the analysis categories. The structure is:feature_nr- Numeric feature identifier.feature_name- Preferred feature name, prioritizing values from the limma tables, then from the cluster table row names, and falling back to the numeric feature ID.gene- Preferred gene symbol from theannotationtable if available, otherwise taken from the cluster tables.cluster_<cond1>/cluster_<cond2>- Cluster assignments for each time-effect condition, named according to the elements ofclustered_hits_levels.cluster_cat2- Present only if category 2 results are available; a combined cluster label in the form"<cluster_<cond1>>_<cluster_<cond2>>"for features that are significant in category 2. If this value isNA, the feature was not a category 2 hit.cluster_cat3- Present only if category 3 results are available; a combined cluster label in the form"<cluster_<cond1>>_<cluster_<cond2>>"for features that are significant in category 3. If this value isNA, the feature was not a category 3 hit.
For any category-specific cluster column (
cluster_<cond1>,cluster_<cond2>,cluster_cat2,cluster_cat3), a value ofNAindicates that the feature was not significant (not a hit) in that category.spline_resultsA named list summarizing the fitted spline trajectories, their shared time grid, and effect-size based significance flags. Structure:
time_gridNumeric vector of length \(T\) giving the common time points (e.g., hours since cultivation start) on which all splines were predicted.
predictionsNamed list by condition (e.g.,
constant,temp_shift). Each entry is a numeric matrix of size \(N \times T\) with rows corresponding to features and columns totime_grid. Values are the predicted spline trajectories on the absolute scale used in the analysis (e.g., log2-CPM aftervoom). Row order matches the feature order used throughout the analysis.time_effect_effect_sizeNamed list by condition with a numeric vector (length \(N\)) per condition giving the cumulative travel (integrated temporal change) of each feature
s spline across \code{time_grid}. Larger values indicate stronger within-condition temporal modulation.} \item{\code{time_effect_passed_threshold}}{Named list by condition with a logical vector (length \eqn{N}) per condition indicating whether the corresponding \code{time_effect_effect_size} exceeds the user-defined effect-size threshold (i.e., time-effect hits).} \item{\code{interaction_effect_size}}{Numeric vector (length \eqn{N}) giving the \emph{differential cumulative travel} between the two condition-specific splines of each feature, computed on the same \code{time_grid}. Larger values indicate stronger differences in temporal behaviour between conditions (condition-time interaction).} \item{\code{interaction_passed_threshold}}{Logical vector (length \eqn{N}) indicating whether \code{interaction_effect_size} exceeds the interaction effect-size threshold (i.e., features with significantly different temporal profiles across conditions).} } Unless stated otherwise, vectors are aligned to the same feature order used in the prediction matrices; condition names match \code{levels(meta[[condition]])}. } \item{\code{report_payload}}{ A structured list containing all information required to generate downstream clustering reports and visualizations, without recomputing any statistical results. This payload is designed to be consumed by dedicated reporting functions (e.g., \code{create_clustering_report()}) and enables a strict separation between computation and reporting. The contents of \code{report_payload} are intended to be treated as read-only and should not be modified by the user. Any changes to visualization or reporting behavior should be performed by supplying appropriate arguments to the reporting functions rather than altering the payload itself. The payload includes: \describe{ \item{\code{splineomics}}{ The originalSplineOmicsobject used for the analysis, providing access to the input data, metadata, design, spline parameters, and analysis settings. } \item{\code{all_levels_clustering}}{ Clustering results for each condition level (Category 1) and, when available, clustering metadata for Categories 2/3. } \item{\code{genes}}{character()` Vector of gene or feature names corresponding to the analyzed features.adj_pthresh_time_effectAdjusted p-value threshold used to define significant time-effect hits (Category 1).
adj_pthresh_avrg_diff_conditionsAdjusted p-value threshold used for average differences between conditions (Category 2).
adj_pthresh_interaction_condition_timeAdjusted p-value threshold used for condition–time interaction effects (Category 3).
category_2_and_3_hitsData structure describing significant features from Categories 2 and 3 (average differences and interactions), including effect-size annotations.
min_effect_sizeNamed list of effect-size thresholds applied in addition to statistical significance.
predicted_timecurvesPredicted spline trajectories and associated summaries (time grid, predictions per condition, and effect-size metrics).
Examples
# Toy data: 4 features x 6 samples (two conditions, three time points)
toy_data <- matrix(
c(
3, 5, 8, 12, 17, 23, # f1
23, 17, 13, 9, 6, 4, # f2
5, 3, 2, 2, 3, 5, # f3
1, 4, 9, 8, 4, 1, # f4
10, 10, 10, 10, 10, 10, # f5
2, 2, 2, 9, 12, 15, # f6
4, 5, 7, 10, 14, 19, # f7
12, 11, 9, 8, 9, 12 # f8
),
nrow = 8, ncol = 6, byrow = TRUE,
dimnames = list(paste0("f", 1:8), paste0("s", 1:6))
)
toy_meta <- data.frame(
Time = c(0, 1, 2, 0, 1, 2),
condition = rep(c("WT", "KO"), each = 3),
Replicate = rep(c("R1", "R2"), each = 3),
row.names = colnames(toy_data),
stringsAsFactors = FALSE
)
toy_annot <- data.frame(
feature_nr = 1:8,
gene = c("G1", "G2", "G3", "G4"),
stringsAsFactors = FALSE
)
# Stub limma "top tables" with minimal required fields
# (feature_nr + adj.P.Val)
tt_wt <- data.frame(feature_nr = 1:4, adj.P.Val = c(0.01, 0.20, 0.04, 0.60))
tt_ko <- data.frame(feature_nr = 1:4, adj.P.Val = c(0.50, 0.03, 0.70, 0.02))
tt_c2 <- data.frame(feature_nr = 1:4, adj.P.Val = c(0.04, 0.70, 0.80, 0.90))
tt_c3 <- data.frame(feature_nr = 1:4, adj.P.Val = c(0.20, 0.90, 0.03, 0.80))
design_str <- "~ 1 + Time*condition"
# Minimal spline parameters required by spline machinery
spline_params <- list(
spline_type = "n", # natural cubic splines
dof = 1L # degrees of freedom for the spline basis
)
toy_splineomics <- list(
data = toy_data,
meta = toy_meta,
annotation = toy_annot,
report_info = list(
omics_data_type = "RNA-seq",
data_description = "toy example",
data_collection_date = "2025-01-01",
analyst_name = "Example",
contact_info = "example@example.org",
project_name = "ToyProject"
),
design = design_str,
mode = "integrated",
condition = "condition",
spline_params = spline_params,
meta_batch_column = NULL,
meta_batch2_column = NULL,
limma_splines_result = list(
time_effect = list(WT = tt_wt, KO = tt_ko),
avrg_diff_conditions = tt_c2,
interaction_condition_time = tt_c3
),
feature_name_columns = "gene"
)
class(toy_splineomics) <- "SplineOmics"
toy_splineomics <- run_limma_splines(toy_splineomics)
#> Running Levene's test both feature wise and sample wise to implicitly
#> • decide whether to use the limma array weights or not.
#> No random effects: fitting model with lmFit()...
#> Running feature wise Levene's test...
#>
#> ------------------------------------------------------------
#> Fraction of features violating homoscedasticity
#> (p < 0.050): 0.00% (0/8 features)
#> No violating features found.
#> ------------------------------------------------------------
#> Running Levene's test across samples to detect
#> • inter-sample variance differences...
#> Levene's test p-value (sample-level): 0.1425
#> ✅ No strong evidence of inter-sample variance differences.
#> ------------------------------------------------------------
#> ✅ No strong evidence for heteroscedasticity.
#> • Proceeding WITHOUT using robust strategy.
#> ------------------------------------------------------------
# Clustering configuration: fixed k per condition
nr_k <- list(WT = 2L, KO = 2L)
# Keep outputs light and write into a temporary directory
out <- cluster_hits(
splineomics = toy_splineomics,
nr_clusters = nr_k,
adj_pthresh_time_effect = 0.05,
adj_pthresh_avrg_diff_conditions = 0.05,
adj_pthresh_interaction_condition_time = 0.05,
min_effect_size = list(
time_effect = 0,
avg_diff_cond = 0,
interaction_cond_time = 0
),
genes = toy_annot$gene,
)