R6 class that encapsulates the mlr3 pipeline for biomarker discovery. Guarantees zero data leakage by enforcing all preprocessing, feature selection, and model training within proper cross-validation folds.
Details
OmicPipeline is the central class for OmicSelector 2.0. It replaces the legacy script-based approach with a rigorous, composable, and reproducible architecture.
Key features: - All preprocessing (imputation, scaling) occurs inside CV folds - Feature selection is embedded in the inner loop of nested CV - Oversampling (SMOTE/ROSE) is applied only to training data per fold - Factory methods generate configured GraphLearners
Methods
Method new()
Create a new OmicPipeline object
Usage
OmicPipeline$new(
data,
target,
positive = NULL,
patient_id = NULL,
batch = NULL,
id = "omic_task"
)Arguments
dataEither a data.frame or a named list of data.frames for multi-omics. For multi-omics, use named list: list(rna = rna_data, mirna = mirna_data). Features will be namespaced: rna::gene1, mirna::hsa-miR-21.
targetName of the target column
positivePositive class label (for binary classification)
patient_idOptional column name for patient grouping (prevents leakage)
batchOptional column name for batch information
idUnique identifier for this pipeline
Method create_graph_learner()
Create a GraphLearner with proper leakage prevention
Usage
OmicPipeline$create_graph_learner(
filter = "anova",
model = "ranger",
n_features = 20,
impute_method = "median",
scale = TRUE,
oversample = NULL,
batch_correct = FALSE
)Arguments
filterFilter method name (e.g., "anova", "mrmr", "correlation")
modelModel type (e.g., "ranger", "glmnet", "svm")
n_featuresNumber of features to select (or proportion if < 1)
impute_methodImputation method ("median", "mean", "sample")
scaleLogical, whether to scale features
oversampleOversampling method (NULL, "smote", "rose")
batch_correctLogical or character. If TRUE, adds FrozenComBat batch correction using the batch column specified in pipeline creation. If a character string, uses that as the batch column name. Default: FALSE.
Method create_auto_fselector()
Create an AutoFSelector for inner-loop feature selection tuning
Usage
OmicPipeline$create_auto_fselector(
learner,
filter_values = c(5, 10, 20, 50),
inner_resampling = NULL,
measure = NULL
)Method benchmark()
Run benchmark with proper nested cross-validation
Method get_modality_info()
Get modality information for multi-omics data
Examples
if (FALSE) { # \dontrun{
# Create pipeline from data
pipeline <- OmicPipeline$new(
data = my_data,
target = "outcome",
positive = "Case"
)
# Create a graph learner with feature selection
learner <- pipeline$create_graph_learner(
filter = "anova",
model = "ranger",
n_features = 20
)
# Run nested cross-validation
result <- pipeline$benchmark(learner, outer_folds = 5, inner_folds = 3)
} # }