Introduction
OmicSelector 2.0 is a PhD-level toolkit for high-dimensional biomarker discovery that guarantees scientific validity through rigorous machine learning methodology.
Quick Start (5 minutes)
Installation
# From GitHub
remotes::install_github("kstawiski/OmicSelector")
# Load the package
library(OmicSelector)Basic Workflow
Step 1: Create Pipeline
# Load your data
data <- read.csv("expression.csv")
# Create pipeline
pipeline <- OmicPipeline$new(
data = data,
target = "outcome",
positive = "Case"
)Step 2: Build Graph Learner
The GraphLearner encapsulates all preprocessing within CV folds:
# Impute → Scale → Filter → Model (all inside CV)
learner <- pipeline$create_graph_learner(
filter = "anova", # Feature selection method
model = "ranger", # Random Forest
n_features = 20, # Select top 20 features
scale = TRUE # Standardize features
)Step 3: Run Nested Cross-Validation
# Create benchmark service with nested CV
service <- BenchmarkService$new(
task = pipeline,
outer_folds = 5, # Evaluation folds
inner_folds = 3 # Feature selection folds
)
# Add learner and run
service$add_learner(learner)
result <- service$run()Step 4: Check Stability
# Compute feature selection stability from benchmark result
stability <- compute_stability_from_resample(
result$benchmark_result,
all_features = pipeline$get_feature_names()
)
print(stability)
# Nogueira Stability Index: 0.78
# Interpretation: Good - Reasonably stable feature selectionStep 5: Generate Report
# Create TRIPOD+AI compliant report
generate_tripod_report(
results = result,
output_file = "my_analysis_report.html"
)Multi-Omics Analysis
OmicSelector 2.0 supports multi-omics data with automatic feature namespacing:
# Prepare multi-omics data as named list
multi_data <- list(
rna = rna_expression, # Gene expression
mirna = mirna_expression, # miRNA expression
prot = protein_levels # Proteomics
)
# Target column should be in ONE modality
multi_data$rna$outcome <- clinical_outcome
# Create pipeline - features are namespaced (rna::gene1, mirna::mir21)
pipeline <- OmicPipeline$new(
data = multi_data,
target = "outcome",
positive = "Case"
)
# Check modalities
pipeline$get_modality_info()
#> modality n_features n_samples has_target
#> 1 rna 5000 200 TRUE
#> 2 mirna 800 200 FALSE
#> 3 prot 1500 200 FALSEKey Concepts
Why Nested Cross-Validation?
Standard CV leaks information when feature selection happens before the split:
WRONG (Leaky):
1. Select features on ALL data ← Leakage!
2. Split into train/test
3. Train model
4. Evaluate (overly optimistic)
RIGHT (OmicSelector):
1. Split into outer folds
For each outer fold:
2. Split training into inner folds
For each inner fold:
3. Select features on inner training only
4. Train model
5. Aggregate best features
6. Final model on outer training
7. Evaluate on outer test (unbiased)
Next Steps
-
Phase 5 Features: Try GOF filters, AutoXAI, and
stability ensembles
- See the Phase 5 Advanced Features vignette
- Advanced Usage: Explore custom filter combinations
- External Validation: Use frozen preprocessing for new data
- TRIPOD+AI Compliance: Review generated reports for publication
Phase 5 Quick Preview
# GOF filters for sparse/zero-inflated data
learner <- pipeline$create_graph_learner(
filter = "gof_ks", # Kolmogorov-Smirnov filter
model = "ranger",
n_features = 30
)
# AutoXAI with correlation warnings
xai <- xai_pipeline(learner, task, cor_threshold = 0.7)
plot_xai_importance(xai)
# Bootstrap stability ensemble
ensemble <- create_stability_ensemble(preset = "default", n_bootstrap = 100)
ensemble$fit(task, seed = 42)
stable_features <- ensemble$get_feature_importance(30)
# Bayesian hyperparameter tuning
autotuner <- make_autotuner_glmnet(task, n_evals = 20)
autotuner$train(task)Getting Help
- GitHub Issues: https://github.com/kstawiski/OmicSelector/issues
- Documentation: https://biostat.umed.pl/OmicSelector/