Performs clustering on hits from top tables generated by differential expression analysis. This function filters hits based on adjusted p-value thresholds, extracts spline coefficients for significant features, normalizes these coefficients, and applies hierarchical clustering. The results, including clustering assignments and normalized spline curves, are saved in a specified directory and compiled into an HTML report.
Usage
cluster_hits(
splineomics,
nr_clusters,
adj_pthresholds = c(0.05),
adj_pthresh_avrg_diff_conditions = 0.05,
adj_pthresh_interaction_condition_time = 0.05,
min_effect_size = list(time_effect = 0, avg_diff_cond = 0, interaction_cond_time = 0),
genes = NULL,
plot_info = list(y_axis_label = "Value", time_unit = "min", treatment_labels = NA,
treatment_timepoints = NA),
plot_options = list(cluster_heatmap_columns = FALSE, meta_replicate_column = NULL),
raw_data = NULL,
report_dir = NULL,
max_hit_number = 25
)
Arguments
- splineomics
An S3 object of class `SplineOmics` that contains all the necessary data and parameters for the analysis, including:
data
: The data matrix with the values. The columns are the samples (timepoint + replicate combo) and the rows are the features (e.g. genes or proteins).meta
: A dataframe containing metadata corresponding to thedata
, must include a 'Time' column and any columns specified byconditions
. In general, the columns of meta correspond to the different types of metadata, and each row corresponds to a column of data (contains the metadata for that sample).annotation
: A dataframe that maps the rows ofdata
to annotation info, such as the gene name or database identifiers.report_info
: A named list describing the experiment. Must include the following fields: -"omics_data_type"
-"data_description"
-"data_collection_date"
-"analyst_name"
-"contact_info"
-"project_name"
May also include the following optional fields: -
"method_description"
-"results_summary"
-"conclusions"
design
: A character of length 1 representing the limma design formula.mode
: Specifies how the design formula is constructed: either `"isolated"` or `"integrated"`.- `"isolated"`: Each level is analyzed independently, using only the subset of data corresponding to that level. The design formula does not include the condition variable, since only one condition is present in each subset.
- `"integrated"`: All levels are analyzed together in a single model, using the full dataset. The design formula includes the condition variable (and optionally interaction terms with it) so that results are estimated jointly across all levels.
condition
: Character vector of length 1 specifying the column name inmeta
used to define groups for analysis.spline_params
: A list of spline parameters for the analysis.meta_batch_column
: A character string specifying the column name in the metadata used for batch effect removal.meta_batch2_column
: A character string specifying the second column name in the metadata used for batch effect removal.limma_splines_result
: A list of data frames, each representing a top table from differential expression analysis, containing at least 'adj.P.Val' and expression data columns.feature_name_columns
: Character vector of strings that each specify a column of the original data dataframe which were used to automatically build the feature names with theextract_data
function.
- nr_clusters
Named list specifying the number of clusters per condition level. The list must have one element per condition level, and each element must be named exactly with the corresponding condition name (e.g.,
"condition1"
,"condition2"
).Each element's value controls the
k
used by k-means for that level:- Single integer (
k > 0
, e.g.3
): use exactly that many clusters. - Integer range (written with:
, e.g.2:6
): choose thek
within the range that minimizes the Bayesian Information Criterion (BIC) computed from the k-means fit over that range (lower is better). Ties are broken by the first minimum encountered.Notes
All condition levels must be present exactly once as names.
Values must be positive integers; ranges must be increasing (e.g.
2:6
).BIC is computed from k-means using Euclidean distance. A common form is \(\mathrm{BIC} = n \log(\mathrm{WCSS}/n) + k \log(n)\, p\), where \(n\) is the number of series, \(p\) the number of timepoints (features), and \(\mathrm{WCSS}\) the total within-cluster sum of squares for the fit.
Clustering will fail if the requested
k
is not strictly less than the number of available series for that level.
Example # Fixed k for condition1, BIC-selected k for condition2 nr_clusters <- list( condition1 = 4, condition2 = 2:6 )
- adj_pthresholds
Numeric vector of p-value thresholds for filtering hits in each top table. The order of the elements determines which adj.p-value threshold is assigned to which condition (the first element gets assigned to the first condition, the second to the second, etc.).
- adj_pthresh_avrg_diff_conditions
p-value threshold for the results from the average difference of the condition limma result.
- adj_pthresh_interaction_condition_time
p-value threshold for the results from the interaction of condition and time limma result.
- min_effect_size
A named list that specifies the minimum effect size thresholds to consider a feature as biologically meaningful, in addition to statistical significance. This allows users to filter out "trivial" hits that pass adjusted p-value cutoffs but show negligible effect sizes.
The list must contain the following elements: - `time_effect`: Minimum cumulative travel for time effects (Category 1). Features with a smaller travel will be ignored even if significant. - `avg_diff_cond`: Minimum absolute effect size for average differences between conditions (Category 2). Ensures that only contrasts with a relevant magnitude are reported. - `interaction_cond_time`: Minimum effect size for the interaction between condition and time (Category 3). This controls how large the differential curve travel must be across conditions to count as a hit.
Values should be numeric scalars (typically >0). For example: `min_effect_size = list(time_effect = 1, avg_diff_cond = 1, interaction_cond_time = 2)` will only keep features with cumulative travels or condition-time differences above those cutoffs. Use smaller values (e.g., 0.1) for permissive filtering, or larger values for more conservative thresholds.
The default is 0 for all three elements.
- genes
A character vector of gene names corresponding to the features to be analyzed. The order of entries must match the feature order in
data
. Gene names should be standardized (cleaned) to ensure compatibility with downstream databases used for overrepresentation analysis after clustering.- plot_info
List with optional elements used to annotate spline plots: - y_axis_label: single string for the y-axis label. - time_unit: single string used in the x-axis label. - treatment_labels: named list of single strings. - treatment_timepoints: named list of single numeric values.
If any treatment list is present, both must be present. The two lists must have identical name sets. Allowed names are the values of `meta[[condition]]` and the special name "double_spline_plots", which generates a treatment line for the plots of limma category 2 and 3, so for the average difference between the conditions and the interaction between condition and time.
Vertical dashed lines are drawn at the given timepoints for facets whose level name matches a list name, and labeled with the corresponding string (e.g., feeding, temperature shift).
- plot_options
A named list controlling optional plot customization. The list can include one or both of the following entries (any not supplied will fall back to their default values):
cluster_heatmap_columns
(`logical`, default =FALSE
): Whether to cluster the columns in the heatmap.meta_replicate_column
(`character(1)`, default =NULL
): Name of the column inmeta
that encodes replicate information. If supplied, spline plot data points are colored by replicate, allowing replicate-level variation to be assessed.
- raw_data
Data matrix with the raw (unimputed) data, still containing NA values. When provided, it highlights the datapoints in the spline plots that originally where NA and that were imputed.
- report_dir
Character string specifying the directory path where the HTML report and any other output files should be saved. When no path is specified, then the function runs but no HTML report is generated.
- max_hit_number
Maximum number of hits which are plotted within each cluster. This can be used to limit the computation time and size of the HTML report in the case of many hits.
Value
A named list with three elements:
cluster_table
A tibble containing one row per
feature_nr
with metadata and cluster assignments across the analysis categories. The structure is:feature_nr
- Numeric feature identifier.feature_name
- Preferred feature name, prioritizing values from the limma tables, then from the cluster table row names, and falling back to the numeric feature ID.gene
- Preferred gene symbol from theannotation
table if available, otherwise taken from the cluster tables.cluster_<cond1>
/cluster_<cond2>
- Cluster assignments for each time-effect condition, named according to the elements ofclustered_hits_levels
.cluster_cat2
- Present only if category 2 results are available; a combined cluster label in the form"<cluster_<cond1>>_<cluster_<cond2>>"
for features that are significant in category 2. If this value isNA
, the feature was not a category 2 hit.cluster_cat3
- Present only if category 3 results are available; a combined cluster label in the form"<cluster_<cond1>>_<cluster_<cond2>>"
for features that are significant in category 3. If this value isNA
, the feature was not a category 3 hit.
For any category-specific cluster column (
cluster_<cond1>
,cluster_<cond2>
,cluster_cat2
,cluster_cat3
), a value ofNA
indicates that the feature was not significant (not a hit) in that category.spline_results
A named list summarizing the fitted spline trajectories, their shared time grid, and effect-size based significance flags. Structure:
time_grid
Numeric vector of length \(T\) giving the common time points (e.g., hours since cultivation start) on which all splines were predicted.
predictions
Named list by condition (e.g.,
constant
,temp_shift
). Each entry is a numeric matrix of size \(N \times T\) with rows corresponding to features and columns totime_grid
. Values are the predicted spline trajectories on the absolute scale used in the analysis (e.g., log2-CPM aftervoom
). Row order matches the feature order used throughout the analysis.time_effect_effect_size
Named list by condition with a numeric vector (length \(N\)) per condition giving the cumulative travel (integrated temporal change) of each feature`s spline across
time_grid
. Larger values indicate stronger within-condition temporal modulation.time_effect_passed_threshold
Named list by condition with a logical vector (length \(N\)) per condition indicating whether the corresponding
time_effect_effect_size
exceeds the user-defined effect-size threshold (i.e., time-effect hits).interaction_effect_size
Numeric vector (length \(N\)) giving the differential cumulative travel between the two condition-specific splines of each feature, computed on the same
time_grid
. Larger values indicate stronger differences in temporal behaviour between conditions (condition-time interaction).interaction_passed_threshold
Logical vector (length \(N\)) indicating whether
interaction_effect_size
exceeds the interaction effect-size threshold (i.e., features with significantly different temporal profiles across conditions).
Unless stated otherwise, vectors are aligned to the same feature order used in the prediction matrices; condition names match
levels(meta[[condition]])
.plots
A list of all plots generated during the run, corresponding to the visualizations shown in the HTML report produced by this function. Additionally, this plots list also contains the plots showing the consensus clusters of the potential clustering of the interaction of condition and time (category 3) hits.