Skip to content

Main function of the package. The aim of this function is to perform feature selection using multiple methods and to create formulas for benchmarking. It loads the data from working directory. The output is mainly created in files in working directory. Log and temporary files are placed in created `temp` subfolder. This package offers about 60 feature selection methods. Which methods will be check by this function is defined by `m` parameter. Pearls about the methods:

Usage

OmicSelector_OmicSelector(
  wd = getwd(),
  m = c(1:70),
  max_iterations = 10,
  code_path = system.file("extdata", "", package = "OmicSelector"),
  register_parallel = T,
  clx = NULL,
  stamp = as.numeric(Sys.time()),
  prefer_no_features = 11,
  conda_path = "/home/konrad/anaconda3/bin/conda",
  debug = F,
  timeout_sec = 172800,
  type = "auto"
)

Arguments

wd

Working directory with data (`mixed_train.csv`, `mixed_test.csv` and `mixed_validation.csv` as created by `OmicSelector_prepare_split` have to be present).

m

Methods of feature selection to be performed. This has to be a vector of integers with minimum of 1 and maximum of 70. For the definition of numbers please see the vignette.

max_iterations

Maximum number of iterations in selected methods. Setting this too high may results in very long comupting time.

code_path

A folder where the python external scripts are placed (especially for WxNet method). By default the additional code is provided in the package.

register_parallel

Where to use parallel processing to speed up computing time. Seting it to FALSE may aid in debuging.

clx

This parameter may be used for passing the already register computing cluster (created and registered with `doParallel` tools). This may lower the computing time by saving the time to register new cluster.

stamp

A character vector or timestamp used for marking the output files.

prefer_no_features

Maximum number of miRNAs that can be selected by the tools if the method allows for that.

conda_path

Patch to "conda" bindary used for executing python scripts.

debug

Gives additional debug information (saves .rdata after feature selection is completed, prints formulas to log)

timeout_sec

Timeout after the method is terminated if not finished. It may be useful to keep the long methods limited, not to wait ethernity for the results.

type

Parameter 'mode' forwarded to OmicSelector_differential_expression_ttest which is essential in many feature selection methods. Note that if 'var_type.txt' file exists in the working directory it is superior to the value set directly in function - as designed for GUI. Please refer to OmicSelector_differential_expression_ttest manual to understand how 'mode' worOmicSelector_

Value

The list of selected formulas. Note that, due to purpose of this package `OmicSelector_merge_formulas` may be a better option to get the output of processes run by this function.

Details

- Sig = miRNAs with p-value <0.05 after BH correction (DE using t-test) - Fcsig = sig + absolute log2FC filter (included if abs. log2FC>1) - Cfs = Correlation-based Feature Selection for Machine Learning (more: https://www.cs.waikato.ac.nz/~mhall/thesis.pdf) - Classloop = Classification using different classification algorithms (classifiers) with the embedded feature selection and using the different schemes for the performance validation (more: https://rdrr.io/cran/Biocomb/man/classifier.loop.html) - Fcfs = CFS algorithm with forward search (https://rdrr.io/cran/Biocomb/man/select.forward.Corr.html) - MDL methods = minimal description length (MDL) discretization algorithm with different a method of feature ranking or feature selection (AUC, SU, CorrSF) (more: https://rdrr.io/cran/Biocomb/man/select.process.html) - bounceR = genetic algorithm with componentwise boosting (more: https://www.statworx.com/ch/blog/automated-feature-selection-using-bouncer/) - RandomForestRFE = recursive feature elimination using random forest with resampling to assess the performance. (more: https://topepo.github.io/caret/recursive-feature-elimination.html#resampling-and-external-validation) - GeneticAlgorithmRF (more: https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html) - SimulatedAnnealing = makes small random changes (i.e. perturbations) to an initial candidate solution (more: https://topepo.github.io/caret/feature-selection-using-simulated-annealing.html) - Boruta (more: https://www.jstatsoft.org/article/view/v036i11/v36i11.pdf) - spFSR = simultaneous perturbation stochastic approximation (SPSA-FSR) (more: https://arxiv.org/abs/1804.05589) - varSelRF = using the out-of-bag error as minimization criterion, carry out variable elimination from random forest, by successively eliminating the least important variables (with importance as returned from random forest). (more: https://www.ncbi.nlm.nih.gov/pubmed/16398926) - WxNet = a neural network-based feature selection algorithm for transcriptomic data (more: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642261/) - Step = backward stepwise method of feature selection based on logistic regression (GLM, family = binomial) using AIC criteria (stepAIC) and functions from My.stepwise package (https://cran.r-project.org/web/packages/My.stepwise/index.html)

For more detailed defitions please see the tutorial vignette.

Examples

# NOT RUN: (to speed up check, but this is a valid example for your real time projects)
# suppressMessages(library(foreach))
# suppressMessages(library(doParallel))
# suppressMessages(library(parallel))
# suppressMessages(library(doParallel))
# m = 1:56 # which methods to check?
# cl <- makePSOCKcluster(useXDR = TRUE, 5) # 5 threds by default
# doParallel:: registerDoParallel(cl)
# iterations = length(m)
# pb <- txtProgressBar(max = iterations, style = 3)
# progress <- function(n) setTxtProgressBar(pb, n)
# opts <- list(progress = progress)
# foreach(i = m, .verbose = TRUE, .options.snow = opts) %dopar%
# {
# suppressMessages(library(OmicSelector))
# setwd("~/public/Projekty/KS/OmicSelector/vignettes") # change it you to your working directory
# OmicSelector_OmicSelector(m = i, max_iterations = 1, stamp = "tutorial", debug = T) # we set debug to get more output
# }
# stopCluster(cl)