Perform over-representation analysis with the results from cluster_hits()
Source:R/run_ora.R
run_ora.RdThis function generates a overrepresentation analysis report based on clustered hit levels, gene data, and specified databases. It accomplishes this by using the R package clusterProfiler. As output, you will receive a list of the plot objects it generated, and an HTML report with embedded files containing the enrichment results, and dotplots visualizing the enrichment.
Usage
run_ora(
cluster_table,
databases,
report_info,
cluster_hits_report_name,
clusterProfiler_params = NA,
mapping_cfg = list(method = "none", from_species = NULL, to_species = NULL),
enrichGO_cfg = NULL,
universe = NULL,
report_dir = NULL,
verbose = TRUE
)Arguments
- cluster_table
tibble: A tibble containing one row perfeature_nrwith metadata and cluster assignments across the analysis categories. It includes:feature_nr:numeric(1)Numeric feature identifier.feature_name:character(1)Preferred feature name from the source data, falling back to the numeric ID if none is available.gene:character(1)Preferred gene symbol from the annotation or cluster data.cluster_<cond1>/cluster_<cond2>:integer(1)Cluster assignments for each time-effect condition.cluster_cat2:character(1)(Optional) Combined cluster label for category 2 hits in the form"<cluster_<cond1>>_<cluster_<cond2>>";NAif the feature was not a category 2 hit.cluster_cat3:character(1)(Optional) Combined cluster label for category 3 hits in the form"<cluster_<cond1>>_<cluster_<cond2>>";NAif the feature was not a category 3 hit.
For any category-specific cluster column, a value of
NAindicates that the feature was not significant (not a hit) in that category.- databases
data.frame: Adata.framethat defines the gene set collections to be tested in the overrepresentation analysis. Must contain exactly three columns:- DB
character(1) The database identifier (e.g., KEGG, GO_BP, Reactome).
- Geneset
character(1) The name of the gene set or pathway within the database.
- Gene
character(1) A gene identifier belonging to the gene set (e.g., gene symbol, Ensembl ID).
Each row corresponds to one
(database, geneset, gene)association. The same gene may appear in multiple gene sets.- DB
- report_info
list: A list containing information for the report generation, such as omics_data_type and data_description (this is the list used for all report generating functions of this package).- cluster_hits_report_name
character(1): Single character string specifying the name of the cluster_hits() function report, that contains the results that were used for the overprepresentation analysis here. Must be specified, because otherwise, the connection is not documented.- clusterProfiler_params
list|NULL: A named list of arguments passed directly to the corresponding functions in the clusterProfiler package. Typical entries includepvalueCutoff,pAdjustMethod,minGSSize,maxGSSize, andqvalueCutoff. The names must match the argument names in clusterProfiler; see the clusterProfiler documentation for details. IfNULL(default), the standard clusterProfiler defaults are used.- mapping_cfg
list|NULL: A named list that controls the optional behavior of automatically mapping gene symbols across species. This is useful when your input gene symbols (e.g., from CHO cells) do not match the species used by the enrichment databases (e.g., human or mouse). By default, no mapping is performed and gene symbols are used as-is. If mapping is desired, this list must contain the following three elements:- method
character(1): Mapping method to use. One ofnone(default; no mapping),gprofiler(online, via the g:Profiler API), ororthogene(offline, if installed).- from_species
character(1): Source species code, e.g.cgriseusfor CHO. Must match the expected format for the selected tool.- to_species
character(1): Target species code, e.g.hsapiensfor human. This must be the species used in your ORA database and must also match the expected format for the selected tool.
- enrichGO_cfg
list|NULL: A named list specifying the configuration for running GO enrichment with Bioconductor'senrichGO. This is only needed when you want to perform GO Biological Process (BP), Molecular Function (MF), or Cellular Component (CC) enrichment using Bioconductor's organism databases (e.g.,org.Mm.eg.dbfor mouse).The list must be named according to the GO ontology, e.g.,
"GO_BP","GO_MF","GO_CC". Each entry must provide:OrgDb:character(1)The organism database, e.g.,org.Mm.eg.db.keyType:character(1)The gene identifier type, e.g.,"SYMBOL".ontology:character(1)One of"BP","MF", or"CC".
If
enrichGO_cfgisNULL(default), no Bioconductor-based GO enrichment is performed. All enrichment runs throughenricherwith the provided TERM2GENE mappings.- universe
character()|NULL: Enrichment background data. This is a parameter of clusterProfiler; for details, please check the documentation of the clusterProfiler R package.- report_dir
character(1): Character string specifying the directory path where the HTML report and any other output files should be saved. When no path is specified, then the function runs but no HTML report is generated.- verbose
logical(1): Boolean flag controlling the display of messages.
Value
A nested, named list whose top-level elements correspond to the
limma result categories. The exact set of elements depends on mode:
mode == "isolated"Two elements are returned, one per condition level:
time_effect_condition_<level1>andtime_effect_condition_<level2>.mode == "integrated"The two time-effect elements above, plus (only if there are significant hits at the chosen thresholds) up to two additional elements:
avrg_diff_conditionsandinteraction_condition_time. Note that the clusters ofinteraction_condition_timeare "combo-clusters" made of the cluster membership of the feature in condition 1 and the membership of the same feature in condition 2 (see also the respective documentation for the function cluster_hits() about the cluster_table). For example, if the report generated by this function has the entries 'time_effect_condition_control' and 'time_effect_condition_treatment', and the section 'interaction_condition_time' contains entries such as 'cluster_4_2', then the first number (4) is the cluster of condition control, and the second number (2) is the cluster of condition treatment.
Each top-level result category element is a list with the fields:
dotplotA
ggplotobject: the dot plot of over-representation results (clusterProfiler) for that category.dotplot_nrowsNumeric scalar giving a suggested plot height (in rows / relative units) that prints nicely for the number of enriched terms shown.
ora_resultsA nested list of the raw enrichment results, structured as:
- cluster \(\rightarrow\) database
For each cluster in the category, there is a sublist with one entry per database used in the enrichment. The value of each entry is either
NA(no terms enriched for that cluster-database) or adata.frameas returned byclusterProfiler::enricher()for the enriched terms.
In summary, the full shape is:
list(
time_effect_condition_<level1> = list(
dotplot = ggplot,
dotplot_nrows = numeric(1),
ora_results = list(
<cluster_1> = list(<database_1> = NA|data.frame, ...),
<cluster_2> = list(<database_1> = NA|data.frame, ...),
...
)
),
time_effect_condition_<level2> = list(...),
avrg_diff_conditions = list(...),
interaction_condition_time = list(...)
)
Examples
{
set.seed(1)
# --- toy cluster table (two "conditions") ------------------------------
toy_genes <- paste0("G", 1:8)
cluster_table <- tibble::tibble(
feature_nr = 1:8,
feature_name = paste0("feat_", 1:8),
gene = toy_genes,
cluster_condA = c(1, 1, 2, 2, NA, NA, 1, 2),
cluster_condB = c(NA, 1, NA, 2, 1, 2, 1, NA)
)
# --- toy TERM2GENE database -------------------------------------------
databases <- data.frame(
DB = rep("ToyDB", 6),
Geneset = c(rep("SetA", 3), rep("SetB", 3)),
Gene = c("G1", "G2", "G7", "G3", "G4", "G6"),
stringsAsFactors = FALSE
)
# --- minimal report info ----------------------------------------------
report_info <- list(
omics_data_type = "TOY",
data_description = "Toy dataset for run_ora() example",
data_collection_date = "2025",
analyst_name = "Example Analyst",
contact_info = "analyst@example.org",
project_name = "ToyProject"
)
# --- output directory (temp) -------------------------------------------
report_dir <- file.path(tempdir(), "run_ora_demo")
dir.create(report_dir, showWarnings = FALSE, recursive = TRUE)
# --- permissive params for tiny example --------------------------------
clusterProfiler_params <- list(
pvalueCutoff = 1,
qvalueCutoff = 1,
minGSSize = 1,
maxGSSize = 500
)
# --- run ORA -----------------------------------------------------------
res <- run_ora(
cluster_table = cluster_table,
databases = databases,
report_info = report_info,
cluster_hits_report_name = "cluster_hits_demo",
clusterProfiler_params = clusterProfiler_params,
report_dir = report_dir,
verbose = TRUE
)
# see sections and files written
names(res)
list.files(report_dir, recursive = TRUE)
}
#>
#>
#>
#> Running clusterProfiler for: cluster_condA
#>
#> Cluster: cluster_1
#> Database: ToyDB
#> Foreground genes:3
#> Foreground genes overlapping with database: 3 (100%)
#>
#> Cluster: cluster_2
#> Database: ToyDB
#> Foreground genes:3
#> Foreground genes overlapping with database: 2 (66.7%)
#>
#>
#>
#> Running clusterProfiler for: cluster_condB
#>
#> Cluster: cluster_1
#> Database: ToyDB
#> Foreground genes:3
#> Foreground genes overlapping with database: 2 (66.7%)
#>
#> Cluster: cluster_2
#> Database: ToyDB
#> Foreground genes:2
#> Foreground genes overlapping with database: 2 (100%)
#>
#> Info ORA analysis completed successfully.
#> Your HTML reports are located in the directory: /tmp/Rtmpyq9FLq/run_ora_demo .
#> Please note that due to embedded files, the reports might be flagged as
#> harmful by other software. Rest assured that they provide no harm.
#> [1] "run_ora_report_TOY_30_10_2025-11_25_17.html"