Gene-Centric Multi-Omics Clustering Across Blocks and Layers
Source:R/cluster_genes_multiomics.R
cluster_genes_multiomics.RdPerforms gene-centric clustering of multi-omics time-series data across multiple blocks (e.g., time effect, interaction effect) and data modalities (e.g., transcript, protein, feature-level modalities with many-to-one gene mapping). The function integrates multiple modalities within each block by computing modality-specific gene–gene distances, combining them via user-defined weights, and clustering genes based on the resulting unified distance matrix.
The function is flexible with respect to how spline trajectories,
interaction representations, or feature-level signatures are constructed:
these are precomputed outside the function, and supplied as matrices inside
the blocks structure. Internally, the function simply performs
harmonization of genes, distance computation, weighted integration, and
clustering.
Usage
cluster_genes_multiomics(
blocks,
block_clusters,
modality_meta,
gene_mode = "intersection",
verbose = TRUE
)Arguments
- blocks
A named nested list specifying all data used for clustering. The outer list corresponds to analytical blocks (e.g.,
time_Ctrl,interaction_Ctrl_vs_Treat). Each element of the outer list is itself a named list whose elements are modalities (e.g.,rna,protein,phospho), each being a numeric matrix of dimensionfeatures x spline_points.For one-to-one (gene-level) modalities, rows represent genes directly. In this case, row names must be the gene identifiers themselves and must follow the pattern
<gene_id>. The angle brackets are shown for illustration only and must not be included in the actual row names. Gene identifiers must be consistent across all one-to-one layers; otherwise, genes cannot be matched across omics layers during distance computation and clustering.For many-to-one modalities (e.g., phospho sites, probes), rows represent features that map to genes and are summarized into gene-level pattern signatures based on the metadata tables. For these modalities, row names must follow the pattern
<gene_id>_<feature_id>, where the gene identifier precedes the first underscore. Again, the angle brackets are for illustration only and must not be included in the actual row names.This row-naming convention is critical, as it defines how features are associated with genes and how genes are aligned across modalities prior to signature construction and downstream clustering.
- block_clusters
A named list specifying the amount of clusters per block. The list names must match the names of
blocks. Each element value specifies the number of gene clusters (k) to compute for the corresponding block.- modality_meta
A data frame containing modality-level metadata. One row per (block × modality). Must include:
blockBlock identifier linking to
block_meta.layerModality name within the block.
layer_kNumber of pattern clusters to use for building pattern signatures for many-to-one modalities.
NAfor modalities that are already gene-level.layer_wRelative weight of this modality when combining modality-specific gene–gene distances within the block. Values are treated as relative weights and are normalized internally, so they do not need to sum to 1 (e.g. 1, 1, 2 means the third modality has twice the weight of the others).
- gene_mode
Character string specifying how genes should be harmonized across modalities within each block prior to clustering.
"intersection"Retain only genes present in all modalities of the block. Produces the most interpretable multi-omics clusters.
"union"Retain genes present in any modality of the block. Gene–gene distances are computed using only shared modalities per gene pair, with weights renormalized accordingly. Increases coverage but results in heterogeneous information across genes.
- verbose
Boolean flag indicating if info messages are be shown.
Value
A named list with two tibbles:
cluster_tableA tibble with one row per gene containing block-specific cluster assignments suitable for downstream enrichment analyses. Columns include the gene identifier and one clustering column per analytical block (e.g.,
cluster_<cond>orcluster_cat3_<cond1>_vs_<cond2>). Genes not included in a given block are assignedNA.centroid_infoA tibble with one row per block, modality, and cluster, summarizing modality-specific cluster centroid trajectories and within-cluster coherence. Columns include the block and layer identifiers, cluster label, gene coverage statistics, mean and standard deviation of per-gene R^2 values, optional per-gene R^2 vectors, and the centroid trajectory stored as a list-column.
Examples
set.seed(1)
genes <- paste0("gene", 1:6)
rna_time_ctrl <- matrix(
rnorm(6 * 5),
nrow = 6,
ncol = 5,
dimnames = list(genes, NULL)
)
rna_time_treat <- matrix(
rnorm(6 * 5),
nrow = 6,
ncol = 5,
dimnames = list(genes, NULL)
)
blocks <- list(
time_Ctrl = list(rna = rna_time_ctrl),
time_Treat = list(rna = rna_time_treat)
)
block_clusters <- list(
time_Ctrl = 2L,
time_Treat = 2L
)
modality_meta <- data.frame(
block = c("time_Ctrl", "time_Treat"),
layer = c("rna", "rna"),
layer_k = c(NA_real_, NA_real_),
layer_w = c(1, 1),
stringsAsFactors = FALSE
)
cluster_table <- cluster_genes_multiomics(
blocks = blocks,
block_clusters = block_clusters,
modality_meta = modality_meta,
gene_mode = "intersection"
)
#> [cluster_genes_multiomics] Block 1/2: 'time_Ctrl'
#> [block 'time_Ctrl'] building gene-level layer matrices...
#> [block 'time_Ctrl'] computing and combining layer-wise distance matrices...
#> [block 'time_Ctrl'] clustering genes (k = 2)...
#> [cluster_genes_multiomics] Block 2/2: 'time_Treat'
#> [block 'time_Treat'] building gene-level layer matrices...
#> [block 'time_Treat'] computing and combining layer-wise distance matrices...
#> [block 'time_Treat'] clustering genes (k = 2)...
#> [cluster_genes_multiomics] total runtime: 0.0081 secs
cluster_table
#> $cluster_table
#> # A tibble: 6 × 3
#> gene time_Ctrl time_Treat
#> <chr> <int> <int>
#> 1 gene1 1 2
#> 2 gene2 1 2
#> 3 gene3 1 1
#> 4 gene4 2 1
#> 5 gene5 1 2
#> 6 gene6 1 2
#>
#> $centroid_info
#> # A tibble: 4 × 11
#> block modality modality_type cluster n_genes_cluster n_genes_used coverage
#> <chr> <chr> <chr> <int> <int> <int> <dbl>
#> 1 time_Ctrl rna one_to_one 1 5 5 1
#> 2 time_Ctrl rna one_to_one 2 1 1 1
#> 3 time_Tre… rna one_to_one 1 2 2 1
#> 4 time_Tre… rna one_to_one 2 4 4 1
#> # ℹ 4 more variables: mean_R2 <dbl>, sd_R2 <dbl>, r2_member <I<list>>,
#> # centroid <I<list>>
#>