Skip to contents

This function generates a overrepresentation analysis report based on clustered hit levels, gene data, and specified databases. It accomplishes this by using the R package clusterProfiler. As output, you will receive a list of the plot objects it generated, and an HTML report with embedded files containing the enrichment results, and dotplots visualizing the enrichment.

Usage

run_ora(
  cluster_table,
  databases,
  report_info,
  cluster_hits_report_name,
  clusterProfiler_params = NA,
  mapping_cfg = list(method = "none", from_species = NULL, to_species = NULL),
  enrichGO_cfg = NULL,
  universe = NULL,
  report_dir = NULL
)

Arguments

cluster_table

A tibble containing one row per feature_nr with metadata and cluster assignments across the analysis categories. It includes:

  • feature_nr – Numeric feature identifier.

  • feature_name – Preferred feature name from the source data, falling back to the numeric ID if none is available.

  • gene – Preferred gene symbol from the annotation or cluster data.

  • cluster_<cond1> / cluster_<cond2> – Cluster assignments for each time-effect condition.

  • cluster_cat2 – (Optional) Combined cluster label for category 2 hits in the form "<cluster_<cond1>>_<cluster_<cond2>>"; NA if the feature was not a category 2 hit.

  • cluster_cat3 – (Optional) Combined cluster label for category 3 hits in the form "<cluster_<cond1>>_<cluster_<cond2>>"; NA if the feature was not a category 3 hit.

For any category-specific cluster column, a value of NA indicates that the feature was not significant (not a hit) in that category.

databases

A data.frame that defines the gene set collections to be tested in the overrepresentation analysis. Must contain exactly three columns:

DB

Character. The database identifier (e.g., KEGG, GO_BP, Reactome).

Geneset

Character. The name of the gene set or pathway within the database.

Gene

Character. A gene identifier belonging to the gene set (e.g., gene symbol, Ensembl ID).

Each row corresponds to one `(database, geneset, gene)` association. The same gene may appear in multiple gene sets.

report_info

A list containing information for the report generation, such as omics_data_type and data_description (this is the list used for all report generating functions of this package).

cluster_hits_report_name

Single character string specifying the name of the cluster_hits() function report, that contains the results that were used for the overprepresentation analysis here. Must be specified, because otherwise, the connection is not documented.

clusterProfiler_params

A named list of arguments passed directly to the corresponding functions in the clusterProfiler package. Typical entries include pvalueCutoff, pAdjustMethod, minGSSize, maxGSSize, and qvalueCutoff. The names must match the argument names in clusterProfiler; see the clusterProfiler documentation for details. If NULL (default), the standard clusterProfiler defaults are used.

mapping_cfg

A named list that controls the optional behavior of automatically mapping gene symbols across species. This is useful when your input gene symbols (e.g., from CHO cells) do not match the species used by the enrichment databases (e.g., human or mouse). By default, no mapping is performed and gene symbols are used as-is. If mapping is desired, this list must contain the following three elements:

method

Mapping method to use. One of none (default; no mapping), gprofiler (online, via the g:Profiler API), or orthogene (offline, if installed).

from_species

Source species code, e.g. cgriseus for CHO. Must match the expected format for the selected tool.

to_species

Target species code, e.g. hsapiens for human. This must be the species used in your ORA database and must also match the expected format for the selected tool.

enrichGO_cfg

A named list specifying the configuration for running GO enrichment with Bioconductor's enrichGO. This is only needed when you want to perform GO Biological Process (BP), Molecular Function (MF), or Cellular Component (CC) enrichment using Bioconductor's organism databases (e.g., org.Mm.eg.db for mouse).

The list must be named according to the GO ontology, e.g., "GO_BP", "GO_MF", "GO_CC". Each entry must provide:

  • OrgDb: The organism database, e.g., org.Mm.eg.db.

  • keyType: The gene identifier type, e.g., "SYMBOL".

  • ontology: One of "BP", "MF", or "CC".

If enrichGO_cfg is NULL (default), no Bioconductor-based GO enrichment is performed. All enrichment runs through enricher with the provided TERM2GENE mappings.

universe

Enrichment background data. This is a parameter of clusterProfiler, for the documentation, please check the documentation of the clusterProfiler R package.

report_dir

Character string specifying the directory path where the HTML report and any other output files should be saved. When no path is specified, then the function runs but no HTML report is generated.

Value

A nested, named list whose top-level elements correspond to the limma result categories. The exact set of elements depends on mode:

mode == "isolated"

Two elements are returned, one per condition level: time_effect_condition_<level1> and time_effect_condition_<level2>.

mode == "integrated"

The two time–effect elements above, plus (only if there are significant hits at the chosen thresholds) up to two additional elements: avrg_diff_conditions and interaction_condition_time.

Each top‑level result category element is a list with the fields:

dotplot

A ggplot object: the dot plot of over-representation results (clusterProfiler) for that category.

dotplot_nrows

Numeric scalar giving a suggested plot height (in rows / relative units) that prints nicely for the number of enriched terms shown.

ora_results

A nested list of the raw enrichment results, structured as:

cluster \(\rightarrow\) database

For each cluster in the category, there is a sublist with one entry per database used in the enrichment. The value of each entry is either NA (no terms enriched for that cluster–database) or a data.frame as returned by clusterProfiler::enricher() for the enriched terms.

In summary, the full shape is:


list(
  time_effect_condition_<level1> = list(
    dotplot        = ggplot,
    dotplot_nrows  = numeric(1),
    ora_results    = list(
      <cluster_1> = list(<database_1> = NA|data.frame, ...),
      <cluster_2> = list(<database_1> = NA|data.frame, ...),
      ...
    )
  ),
  time_effect_condition_<level2> = list(...),
  avrg_diff_conditions          = list(...), # only if hits (integrated mode)
  interaction_condition_time    = list(...)  # only if hits (integrated mode)
)