Performs clustering on hits from top tables generated by differential expression analysis. This function filters hits based on adjusted p-value thresholds, extracts spline coefficients for significant features, normalizes these coefficients, and applies hierarchical clustering. The results, including clustering assignments and normalized spline curves, are saved in a specified directory and compiled into an HTML report.
Usage
cluster_hits(
splineomics,
nr_clusters,
nr_clusters_interaction = 1,
adj_pthresholds = c(0.05),
adj_pthresh_avrg_diff_conditions = 0.05,
adj_pthresh_interaction_condition_time = 0.05,
min_effect_size = list(time_effect = 0, avg_diff_cond = 0, interaction_cond_time = 0),
genes = NULL,
plot_info = list(y_axis_label = "Value", time_unit = "min", treatment_labels = NA,
treatment_timepoints = NA),
plot_options = list(cluster_heatmap_columns = FALSE, meta_replicate_column = NULL),
raw_data = NULL,
report_dir = here::here(),
report = TRUE,
max_hit_number = 25
)
Arguments
- splineomics
An S3 object of class `SplineOmics` that contains all the necessary data and parameters for the analysis, including:
data
: The data matrix with the values. The columns are the samples (timepoint + replicate combo) and the rows are the features (e.g. genes or proteins).meta
: A dataframe containing metadata corresponding to thedata
, must include a 'Time' column and any columns specified byconditions
. In general, the columns of meta correspond to the different types of metadata, and each row corresponds to a column of data (contains the metadata for that sample).annotation
: A dataframe that maps the rows ofdata
to annotation info, such as the gene name or database identifiers.report_info
: A named list describing the experiment. Must include the following fields: -"omics_data_type"
-"data_description"
-"data_collection_date"
-"analyst_name"
-"contact_info"
-"project_name"
May also include the following optional fields: -
"method_description"
-"results_summary"
-"conclusions"
design
: A character of length 1 representing the limma design formula.mode
: Specifies how the design formula is constructed: either `"isolated"` or `"integrated"`.- `"isolated"`: Each level is analyzed independently, using only the subset of data corresponding to that level. The design formula does not include the condition variable, since only one condition is present in each subset.
- `"integrated"`: All levels are analyzed together in a single model, using the full dataset. The design formula includes the condition variable (and optionally interaction terms with it) so that results are estimated jointly across all levels.
condition
: Character vector of length 1 specifying the column name inmeta
used to define groups for analysis.spline_params
: A list of spline parameters for the analysis.meta_batch_column
: A character string specifying the column name in the metadata used for batch effect removal.meta_batch2_column
: A character string specifying the second column name in the metadata used for batch effect removal.limma_splines_result
: A list of data frames, each representing a top table from differential expression analysis, containing at least 'adj.P.Val' and expression data columns.feature_name_columns
: Character vector of strings that each specify a column of the original data dataframe which were used to automatically build the feature names with theextract_data
function.
- nr_clusters
A list whose length matches `top_tables`; each element is a numeric vector of positive integers (e.g. `1:1`, `2:8`) giving the candidate number(s) of clusters for the corresponding condition level.
- nr_clusters_interaction
An integer or a numeric vector of integers (> 0) specifying how many clusters should be used for the clustering of the interaction of condition and time hits. When a vector is provided, then all cluster amounts in that range are tested and the best one automatically taken.
- adj_pthresholds
Numeric vector of p-value thresholds for filtering hits in each top table.
- adj_pthresh_avrg_diff_conditions
p-value threshold for the results from the average difference of the condition limma result. Per default 0 ( turned off).
- adj_pthresh_interaction_condition_time
p-value threshold for the results from the interaction of condition and time limma result. Per default 0 (turned off).
- min_effect_size
A named list that specifies the minimum effect size thresholds to consider a feature as biologically meaningful, in addition to statistical significance. This allows users to filter out "trivial" hits that pass adjusted p-value cutoffs but show negligible effect sizes.
The list must contain the following elements: - `time_effect`: Minimum cumulative travel for time effects (Category 1). Features with a smaller travel will be ignored even if significant. - `avg_diff_cond`: Minimum absolute effect size for average differences between conditions (Category 2). Ensures that only contrasts with a relevant magnitude are reported. - `interaction_cond_time`: Minimum effect size for the interaction between condition and time (Category 3). This controls how large the differential curve travel must be across conditions to count as a hit.
Values should be numeric scalars (typically >0). For example: `min_effect_size = list(time_effect = 1, avg_diff_cond = 1, interaction_cond_time = 2)` will only keep features with cumulative travels or condition-time differences above those cutoffs. Use smaller values (e.g., 0.1) for permissive filtering, or larger values for more conservative thresholds.
The default is 0 for all three elements.
- genes
A character vector containing the gene names of the features to be analyzed. The entries must be in the same order as they appear in data.
- plot_info
List containing the elements y_axis_label (string), time_unit (string), treatment_labels (character vector), treatment_timepoints (integer vector). All can also be NA. This list is used to add this info to the spline plots. time_unit is used to label the x-axis, and treatment_labels and -timepoints are used to create vertical dashed lines, indicating the positions of the treatments (such as feeding, temperature shift, etc.).
- plot_options
A named list controlling optional plot customization. The list can include one or both of the following entries (any not supplied will fall back to their default values):
cluster_heatmap_columns
(`logical`, default =FALSE
): Whether to cluster the columns in the heatmap.meta_replicate_column
(`character(1)`, default =NULL
): Name of the column inmeta
that encodes replicate information. If supplied, spline plot data points are colored by replicate, allowing replicate-level variation to be assessed.
- raw_data
Optional. Data matrix with the raw (unimputed) data, still containing NA values. When provided, it highlights the datapoints in the spline plots that originally where NA and that were imputed.
- report_dir
Character string specifying the directory path where the HTML report and any other output files should be saved. Default is the current working dir.
- report
Boolean TRUE or FALSE value specifying if a report should be generated.
- max_hit_number
Maximum number of hits which are plotted within each cluster. This can be used to limit the computation time and size of the HTML report in the case of many hits. Default is 25.
Value
A named list with three elements:
cluster_table
A tibble containing one row per
feature_nr
with metadata and cluster assignments across the analysis categories. The structure is:feature_nr
– Numeric feature identifier.feature_name
– Preferred feature name, prioritizing values from the limma tables, then from the cluster table row names, and falling back to the numeric feature ID.gene
– Preferred gene symbol from theannotation
table if available, otherwise taken from the cluster tables.cluster_<cond1>
/cluster_<cond2>
– Cluster assignments for each time-effect condition, named according to the elements ofclustered_hits_levels
.cluster_cat2
– Present only if category 2 results are available; a combined cluster label in the form"<cluster_<cond1>>_<cluster_<cond2>>"
for features that are significant in category 2. If this value isNA
, the feature was not a category 2 hit.cluster_cat3
– Present only if category 3 results are available; a combined cluster label in the form"<cluster_<cond1>>_<cluster_<cond2>>"
for features that are significant in category 3. If this value isNA
, the feature was not a category 3 hit.
For any category-specific cluster column (
cluster_<cond1>
,cluster_<cond2>
,cluster_cat2
,cluster_cat3
), a value ofNA
indicates that the feature was not significant (not a hit) in that category.spline_results
A named list summarizing the fitted spline trajectories, their shared time grid, and effect-size based significance flags. Structure:
time_grid
Numeric vector of length \(T\) giving the common time points (e.g., hours since cultivation start) on which all splines were predicted.
predictions
Named list by condition (e.g.,
constant
,temp_shift
). Each entry is a numeric matrix of size \(N \times T\) with rows corresponding to features and columns totime_grid
. Values are the predicted spline trajectories on the absolute scale used in the analysis (e.g., log2-CPM aftervoom
). Row order matches the feature order used throughout the analysis.time_effect_effect_size
Named list by condition with a numeric vector (length \(N\)) per condition giving the cumulative travel (integrated temporal change) of each feature’s spline across
time_grid
. Larger values indicate stronger within-condition temporal modulation.time_effect_passed_threshold
Named list by condition with a logical vector (length \(N\)) per condition indicating whether the corresponding
time_effect_effect_size
exceeds the user-defined effect-size threshold (i.e., time-effect hits).interaction_effect_size
Numeric vector (length \(N\)) giving the differential cumulative travel between the two condition-specific splines of each feature, computed on the same
time_grid
. Larger values indicate stronger differences in temporal behaviour between conditions (condition–time interaction).interaction_passed_threshold
Logical vector (length \(N\)) indicating whether
interaction_effect_size
exceeds the interaction effect-size threshold (i.e., features with significantly different temporal profiles across conditions).
Unless stated otherwise, vectors are aligned to the same feature order used in the prediction matrices; condition names match
levels(meta[[condition]])
.plots
A list of all plots generated during the run, corresponding to the visualizations shown in the HTML report produced by this function. Additionally, this plots list also contains the plots showing the consensus clusters of the potential clustering of the interaction of condition and time (category 3) hits.