The `preprocess_rna_seq_data()` function performs essential preprocessing steps for raw RNA-seq counts. This includes creating a `DGEList` object, normalizing the counts using the default TMM (Trimmed Mean of M-values) normalization via the `edgeR::calcNormFactors` function, and applying the `voom` transformation from the `limma` package to obtain log-transformed counts per million (logCPM) with associated precision weights. If you require a different normalization method, you can supply your own custom normalization function.
Arguments
- splineomics
An S3 object of class `SplineOmics` that must contain the following elements:
data
: A matrix of the omics dataset, with feature names optionally as row headers (genes as rows, samples as columns).meta
: A dataframe containing metadata corresponding to thedata
. The dataframe must include a 'Time' column and a column specified by thecondition
.design
: A character string representing the design formula for the limma analysis (e.g.,'~ 1 + Phase*X + Reactor'
).spline_params
: A list of spline parameters used in the analysis. This can include:spline_type
: A character string specifying the type of spline. Must be either'n'
for natural cubic splines or'b'
for B-splines.dof
: An integer specifying the degrees of freedom. Required for both natural cubic splines and B-splines.degree
: An integer specifying the degree of the spline (for B-splines only).knots
: Positions of the internal knots (for B-splines).bknots
: Boundary knots (for B-splines).
use_array_weights
: Boolean flag indicating if the robust fit strategy to deal with heteroscedasticity should be used or not. If set to NULL, then this is handeled implicitly based on the result of the Levene test. If this test is significant for at least 10 then the robust strategy is used. The robust strategy uses the function voomWithQualityWeights for RNA-seq data instead of the normal voom function. For other, non-count-based data, the function limma::arrayWeights is used instead, combined with setting the robust argument to TRUE for the limma::eBayes function. In summary, the strategy employed by those functions is to downweights samples with higher variance. This can be neccessary, because linear models have the assumption of homoscedasticity, which means that the variance is (approx.) the same across all datapoints where the linear model is fitted. If this is violated, then the resulting p-values cannot be trusted (common statistical wisdom).bp_cfg
: A named numeric vector specifying the parallelization configuration, with expected names `"n_cores"` and `"blas_threads"`.This controls how many **R worker processes** (`n_cores`) and how many **BLAS/OpenBLAS threads per process** (`blas_threads`) should be used during parallel computation.
If `bp_cfg` is `NULL`, missing, or any of its required fields is `NA`, both `n_cores` and `blas_threads` default to `1`. This effectively disables parallelization and avoids oversubscription of CPU threads.
- normalize_func
An optional normalization function. If provided, this function will be used to normalize the `DGEList` object. If not provided, TMM normalization (via `edgeR::calcNormFactors`) will be used by default. Must take as input the y of: y <- edgeR::DGEList(counts = raw_counts) and output the y with the normalized counts.