Skip to contents

Properly applies FrozenComBat batch correction within each fold of cross-validation. This is the CORRECT way to apply batch correction for ML pipelines - it prevents data leakage by fitting parameters only on training indices and applying to test.

Usage

apply_frozen_combat_cv(
  data,
  batch,
  train_indices,
  test_indices = NULL,
  covariates = NULL,
  parametric = TRUE
)

Arguments

data

Full data matrix (samples x features)

batch

Full batch vector

train_indices

Row indices for training set

test_indices

Row indices for test set (optional)

covariates

Optional covariates data.frame (will be subset by indices)

parametric

Use parametric empirical Bayes (default TRUE)

Value

List with: - corrected_train: Batch-corrected training data - corrected_test: Batch-corrected test data (if test_indices provided) - frozen_combat: The fitted FrozenComBat object (for external validation)

Details

## IMPORTANT: Proper Usage in Nested CV

For nested cross-validation, you should use this function OR the PipeOp:

“`r # Option 1: Use PipeOp in mlr3pipelines (recommended) po_combat <- create_frozen_combat_pipeop(batch_col = "batch") graph <- po_combat

# Option 2: Manual application in custom CV loop for (fold in folds) result <- apply_frozen_combat_cv( data = features, batch = batch_vector, train_indices = fold$train, test_indices = fold$test ) # Use result$corrected_train and result$corrected_test “`

## WRONG: Do NOT do this! “`r # WRONG: Applying ComBat to all data before CV causes leakage! corrected_all <- sva::ComBat(all_data, batch) # LEAKAGE! cv_result <- run_cv(corrected_all) # Inflated performance “`

Examples

if (FALSE) { # \dontrun{
set.seed(42)
data <- matrix(rnorm(200), nrow = 40, ncol = 5)
batch <- rep(c("A", "B"), each = 20)

# 5-fold CV
folds <- split(1:40, rep(1:5, each = 8))

for (i in seq_along(folds)) {
  test_idx <- folds[[i]]
  train_idx <- setdiff(1:40, test_idx)

  result <- apply_frozen_combat_cv(
    data = data,
    batch = batch,
    train_indices = train_idx,
    test_indices = test_idx
  )

  # Train model on result$corrected_train
  # Evaluate on result$corrected_test
}
} # }