This function generates a overrepresentation analysis report based on clustered hit levels, gene data, and specified databases. It accomplishes this by using the R package clusterProfiler. As output, you will receive a list of the plot objects it generated, and an HTML report with embedded files containing the enrichment results, and dotplots visualizing the enrichment.
Usage
run_ora(
cluster_table,
databases,
report_info,
cluster_hits_report_name,
clusterProfiler_params = NA,
mapping_cfg = list(method = "none", from_species = NULL, to_species = NULL),
enrichGO_cfg = NULL,
universe = NULL,
report_dir = NULL
)
Arguments
- cluster_table
A tibble containing one row per
feature_nr
with metadata and cluster assignments across the analysis categories. It includes:feature_nr
– Numeric feature identifier.feature_name
– Preferred feature name from the source data, falling back to the numeric ID if none is available.gene
– Preferred gene symbol from the annotation or cluster data.cluster_<cond1>
/cluster_<cond2>
– Cluster assignments for each time-effect condition.cluster_cat2
– (Optional) Combined cluster label for category 2 hits in the form"<cluster_<cond1>>_<cluster_<cond2>>"
;NA
if the feature was not a category 2 hit.cluster_cat3
– (Optional) Combined cluster label for category 3 hits in the form"<cluster_<cond1>>_<cluster_<cond2>>"
;NA
if the feature was not a category 3 hit.
For any category-specific cluster column, a value of
NA
indicates that the feature was not significant (not a hit) in that category.- databases
A
data.frame
that defines the gene set collections to be tested in the overrepresentation analysis. Must contain exactly three columns:- DB
Character. The database identifier (e.g., KEGG, GO_BP, Reactome).
- Geneset
Character. The name of the gene set or pathway within the database.
- Gene
Character. A gene identifier belonging to the gene set (e.g., gene symbol, Ensembl ID).
Each row corresponds to one `(database, geneset, gene)` association. The same gene may appear in multiple gene sets.
- report_info
A list containing information for the report generation, such as omics_data_type and data_description (this is the list used for all report generating functions of this package).
- cluster_hits_report_name
Single character string specifying the name of the cluster_hits() function report, that contains the results that were used for the overprepresentation analysis here. Must be specified, because otherwise, the connection is not documented.
- clusterProfiler_params
A named list of arguments passed directly to the corresponding functions in the clusterProfiler package. Typical entries include
pvalueCutoff
,pAdjustMethod
,minGSSize
,maxGSSize
, andqvalueCutoff
. The names must match the argument names in clusterProfiler; see the clusterProfiler documentation for details. IfNULL
(default), the standard clusterProfiler defaults are used.- mapping_cfg
A named list that controls the optional behavior of automatically mapping gene symbols across species. This is useful when your input gene symbols (e.g., from CHO cells) do not match the species used by the enrichment databases (e.g., human or mouse). By default, no mapping is performed and gene symbols are used as-is. If mapping is desired, this list must contain the following three elements:
- method
Mapping method to use. One of
none
(default; no mapping),gprofiler
(online, via the g:Profiler API), ororthogene
(offline, if installed).- from_species
Source species code, e.g.
cgriseus
for CHO. Must match the expected format for the selected tool.- to_species
Target species code, e.g.
hsapiens
for human. This must be the species used in your ORA database and must also match the expected format for the selected tool.
- enrichGO_cfg
A named list specifying the configuration for running GO enrichment with Bioconductor's
enrichGO
. This is only needed when you want to perform GO Biological Process (BP), Molecular Function (MF), or Cellular Component (CC) enrichment using Bioconductor's organism databases (e.g.,org.Mm.eg.db
for mouse).The list must be named according to the GO ontology, e.g.,
"GO_BP"
,"GO_MF"
,"GO_CC"
. Each entry must provide:OrgDb
: The organism database, e.g.,org.Mm.eg.db
.keyType
: The gene identifier type, e.g.,"SYMBOL"
.ontology
: One of"BP"
,"MF"
, or"CC"
.
If
enrichGO_cfg
isNULL
(default), no Bioconductor-based GO enrichment is performed. All enrichment runs throughenricher
with the provided TERM2GENE mappings.- universe
Enrichment background data. This is a parameter of clusterProfiler, for the documentation, please check the documentation of the clusterProfiler R package.
- report_dir
Character string specifying the directory path where the HTML report and any other output files should be saved. When no path is specified, then the function runs but no HTML report is generated.
Value
A nested, named list whose top-level elements correspond to the
limma result categories. The exact set of elements depends on mode
:
mode == "isolated"
Two elements are returned, one per condition level:
time_effect_condition_<level1>
andtime_effect_condition_<level2>
.mode == "integrated"
The two time–effect elements above, plus (only if there are significant hits at the chosen thresholds) up to two additional elements:
avrg_diff_conditions
andinteraction_condition_time
.
Each top‑level result category element is a list with the fields:
dotplot
A
ggplot
object: the dot plot of over-representation results (clusterProfiler) for that category.dotplot_nrows
Numeric scalar giving a suggested plot height (in rows / relative units) that prints nicely for the number of enriched terms shown.
ora_results
A nested list of the raw enrichment results, structured as:
- cluster \(\rightarrow\) database
For each cluster in the category, there is a sublist with one entry per database used in the enrichment. The value of each entry is either
NA
(no terms enriched for that cluster–database) or adata.frame
as returned byclusterProfiler::enricher()
for the enriched terms.
In summary, the full shape is:
list(
time_effect_condition_<level1> = list(
dotplot = ggplot,
dotplot_nrows = numeric(1),
ora_results = list(
<cluster_1> = list(<database_1> = NA|data.frame, ...),
<cluster_2> = list(<database_1> = NA|data.frame, ...),
...
)
),
time_effect_condition_<level2> = list(...),
avrg_diff_conditions = list(...), # only if hits (integrated mode)
interaction_condition_time = list(...) # only if hits (integrated mode)
)