Skip to content

The main purpose of OmicSelector is to give you the set of candidate features for further validation of biomarker study. The package performs feature selection first. In the next step the sets of features are tested in the process called “benchmarking”. In benchmarking we test all of those sets of features (biomarkers) using various data-mining (machine learning) methods. Based on the avarage performance of sets in cross-validation or holdout-validation (testing on test set and/or validation set) we can sugesst which of the signatures (set of features) is have the greatest potential in further validation.

Please note that presented methods below are those avaiable in GUI (via web browser). Those can be further extended with ease by users with intermediate R knowledge. Please refer to our extension manuals (comming soon).

Feature selection methods

ID Description
No: 1
all
Get all features (all features staring with 'hsa').
No: 2
sig, sigtop, sigtopBonf, sigtopHolm, topFC, sigSMOTE, sigtopSMOTE, sigtopBonfSMOTE, sigtopHolmSMOTE, topFCSMOTE
Selects features significantly differently expressed between classes by performing unpaired t-test with and without correction for multiple testing. We get: sig - all significant (adjusted p-value less or equal to 0.05) miRNAs with comparison using unpaired t-test and after the Benjamini-Hochberg procedure (BH, false discovery rate); sigtop - sig but limited only to the your prefered number of features (most significant features sorted by p-value), sigtopBonf - uses Bonferroni instead of BH correction, sigtopHolm - uses Holm–Bonferroni instead of BH correction, topFC - selects prefered number of features based on decreasing absolute value of fold change in differential analysis.
All the methods are also checked on dataset balanced with SMOTE (Synthetic Minority Oversampling TEchnique) - those formulas which names are appended with SMOTE.
No: 3
fcsig, fcsigSMOTE
Features significant in DE analysis using unpaired t-test and which absolute log2FC is greater than 1. Thus, features significant and up- or down-regulated in the higher magnitudes. FC - fold-change, DE - differential analysis.
No: 4
cfs, cfsSMOTE, cfs_sig, cfsSMOTE_sig
Correlation-based feature selection (CFS) - a heuristic algorithm selecting features that are highly correlated with class (binary) and lowly correlated with one another. It explores a search space in best-first manner, until stopping criteria are met.
No: 5
classloop
Classifier loop - performs multiple classification procedures using various algorithms (with embedded feature ranking) and various performance metrices. Final feature selection is done by combining the results. Modeling methods used: support vector machines, linear discriminant a nalysis, random forest and nearest shrunken centroid. Features are selected based on the AUC ROC and assessed in k-fold cross-validation according to the documentation. As this requires time, we do not perform it on SMOTEd dataset.
No: 6
classloopSMOTE
Application of classloop on balanced dataset (with SMOTE).
No: 7
classloop_sig
Application of classloop but only on the features which are significant in DE.
No: 8
classloopSMOTE_sig
Application of classloop on balanced training set and only on the features which are significant in DE (after balancing).
No: 9
fcfs
An algorithm similar to CFS, though exploring search space in greedy forward search manner (adding one, most attractive, feature at the time, until such addition does not improve set’s overall quality). Based on Wang et al. 2005 and documented here.
No: 10
fcfsSMOTE
Application of fcfs on balanced training set.
No: 11
fcfs_sig
Application of fcfs on features significant in DE.
No: 12
fcfsSMOTE_sig
Application of fcfs on balanced dataset and on features significant in DE (after balancing).
No: 13
fwrap
A decision tree algorithm and forward search strategy documented here.
No: 14
fwrapSMOTE
Application of fwrap on balanced training set.
No: 15
fwrap_sig
Application of fwrap on features significant in DE.
No: 16
fwrapSMOTE_sig
Application of fwrap on balanced dataset and on features significant in DE (after balancing).
No: 17
AUC_MDL
Feature ranking based on ROC AUC and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below.
No: 18
SU_MDL
Feature ranking based on symmetrical uncertainty and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below.
No: 19
CorrSF_MDL
Feature ranking based on CFS algorithm with forward search and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below.
No: 20
AUC_MDLSMOTE
Feature ranking based on ROC AUC and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE.
No: 21
SU_MDLSMOTE
Feature ranking based on symmetrical uncertainty and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE.
No: 22
CorrSF_MDLSMOTE
Feature ranking based on CFS algorithm with forward search and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE.
No: 23
AUC_MDL_sig
Feature ranking based on ROC AUC and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Only features significant in DE are allowed.
No: 24
SU_MDL_sig
Feature ranking based on symmetrical uncertainty and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Only features significant in DE are allowed.
No: 25
CorrSF_MDL_sig
Feature ranking based on CFS algorithm with forward search and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Only features significant in DE are allowed.
No: 26
AUC_MDLSMOTE_sig
Feature ranking based on ROC AUC and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE. Only features significant in DE are allowed.
No: 27
SU_MDLSMOTE_sig
Feature ranking based on symmetrical uncertainty and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE. Only features significant in DE are allowed.
No: 28
CorrSF_MDLSMOTE_sig
Feature ranking based on CFS algorithm with forward search and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE. Only features significant in DE are allowed.
No: 29
bounceR-full, bounceR-stability
A component-wise-boosting-based algorithm selecting optimal features in multiple iterations of single feature-models construction. See the source here. bounceR-stability gets the most stable features. Wrapper methods implemented here leverage componentwise boosting as a weak learners.
No: 30
bounceR-full_SMOTE, bounceR-stability_SMOTE
A component-wise-boosting-based algorithm selecting optimal features in multiple iterations of single feature-models construction. See the source here. bounceR-stability gets the most stable features. Wrapper methods implemented here leverage componentwise boosting as a weak learners. Performed on the training set balanced with SMOTE.
No: 31
bounceR-full_SIG, bounceR-stability_SIG
A component-wise-boosting-based algorithm selecting optimal features in multiple iterations of single feature-models construction. See the source here. bounceR-stability gets the most stable features. Wrapper methods implemented here leverage componentwise boosting as a weak learners. Only features significant in DE are allowed.
No: 32
bounceR-full_SIGSMOTE, bounceR-stability_SIGSMOTE
A component-wise-boosting-based algorithm selecting optimal features in multiple iterations of single feature-models construction. See the source here. bounceR-stability gets the most stable features. Wrapper methods implemented here leverage componentwise boosting as a weak learners. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 33
RandomForestRFE
Recursively eliminates features from the feature space based on ranking from Random Forrest classifier (retrained woth resampling after each elimination). Details are available here.
No: 34
RandomForestRFESMOTE
Recursively eliminates features from the feature space based on ranking from Random Forrest classifier (retrained woth resampling after each elimination). Details are available here. Performed on the training set balanced with SMOTE.
No: 35
RandomForestRFE_sig
Recursively eliminates features from the feature space based on ranking from Random Forrest classifier (retrained woth resampling after each elimination). Details are available here. Only features significant in DE are allowed.
No: 36
RandomForestRFESMOTE_sig
Recursively eliminates features from the feature space based on ranking from Random Forrest classifier (retrained woth resampling after each elimination). Details are available here. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 37
GeneticAlgorithmRF
Uses genetic algorithm principle to search for optimal subset of the feature space. This uses internally implemented random forest model and 10-fold cross validation to assess performance of the "chromosomes" in each generation. Details are available here.
No: 38
GeneticAlgorithmRFSMOTE
Uses genetic algorithm principle to search for optimal subset of the feature space. This uses internally implemented random forest model and 10-fold cross validation to assess performance of the "chromosomes" in each generation. Details are available here. Performed on the training set balanced with SMOTE.
No: 39
GeneticAlgorithmRF_sig
Uses genetic algorithm principle to search for optimal subset of the feature space. This uses internally implemented random forest model and 10-fold cross validation to assess performance of the "chromosomes" in each generation. Details are available here. Only features significant in DE are allowed.
No: 40
GeneticAlgorithmRFSMOTE_sig
Uses genetic algorithm principle to search for optimal subset of the feature space. This uses internally implemented random forest model and 10-fold cross validation to assess performance of the "chromosomes" in each generation. Details are available here. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 41
SimulatedAnnealingRF
Simulated Annealing - explores a feature space by randomly modifying a given feature subset and evaluating classification performance using new attributes to check whether changes were beneficial. It is is a global search method that makes small random changes (i.e. perturbations) to an initial candidate solution. In this method also random forest is used as a model for evaluation. Details are available here.
No: 42
SimulatedAnnealingRFSMOTE
Simulated Annealing - explores a feature space by randomly modifying a given feature subset and evaluating classification performance using new attributes to check whether changes were beneficial. It is is a global search method that makes small random changes (i.e. perturbations) to an initial candidate solution. In this method also random forest is used as a model for evaluation. Details are available here. Performed on the training set balanced with SMOTE.
No: 43
SimulatedAnnealingRF_sig
Simulated Annealing - explores a feature space by randomly modifying a given feature subset and evaluating classification performance using new attributes to check whether changes were beneficial. It is is a global search method that makes small random changes (i.e. perturbations) to an initial candidate solution. In this method also random forest is used as a model for evaluation. Details are available here. Only features significant in DE are allowed.
No: 44
SimulatedAnnealingRFSMOTE_sig
Simulated Annealing - explores a feature space by randomly modifying a given feature subset and evaluating classification performance using new attributes to check whether changes were beneficial. It is is a global search method that makes small random changes (i.e. perturbations) to an initial candidate solution. In this method also random forest is used as a model for evaluation. Details are available here. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 45
Boruta
Boruta - utilizes random forrest algorithm to iteratively remove features proved to be less relevant than random variables. Details are available in paper by Kursa et al. 2010 or this blog post. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 46
BorutaSMOTE
Boruta - utilizes random forrest algorithm to iteratively remove features proved to be less relevant than random variables. Details are available in paper by Kursa et al. 2010 or this blog post. Performed on the training set balanced with SMOTE.
No: 47
spFSR
spFSR - feature selection and ranking by simultaneous perturbation stochastic approximation. This is an algorithm based on pseudo-gradient descent stochastic optimisation with Barzilai-Borwein method for step size and gradient estimation optimization. Details are available in paper by Zeren et al. 2018.
No: 48
spFSRSMOTE
spFSR - feature selection and ranking by simultaneous perturbation stochastic approximation. This is an algorithm based on pseudo-gradient descent stochastic optimisation with Barzilai-Borwein method for step size and gradient estimation optimization. Details are available in paper by Zeren et al. 2018. Performed on the training set balanced with SMOTE.
No: 49
varSelRF, varSelRFSMOTE
varSelRF - recursively eliminates features using random forrest feature scores, seeking to minimize out-of-bag classification error. Details are available in paper by Díaz-Uriarte et al. 2006. Performed on the unbalanced training set as well as on balanced with SMOTE.
No: 50
Wx, WxSMOTE
Wx - deep neural network-based (deep learning) feature (gene) selection algorithm. We use 2 hidden layers with 16 hidden neurons. Details are available in paper by Park et al. 2019. Performed on the unbalanced training set as well as on balanced with SMOTE.
No: 51
Mystepwise_glm_binomial, Mystepwise_sig_glm_binomial
Stepwise variable selection procedure (with iterations between the 'forward' and 'backward' steps) for generalized linear models with logit link function (i.e. logistic regression). We use p=0.05 as a threshold for both entry (SLE) and stay (SLS). Details are available here. Performed on all features of training set as well as features initially selected in DE (significant in DE).
No: 52
Mystepwise_glm_binomialSMOTE, Mystepwise_sig_glm_binomialSMOTE
Stepwise variable selection procedure (with iterations between the 'forward' and 'backward' steps) for generalized linear models with logit link function (i.e. logistic regression). We use p=0.05 as a threshold for both entry (SLE) and stay (SLS). Details are available here. Performed on all features of training set as well as features initially selected in DE (significant in DE) after balancing the training set with SMOTE.
No: 53
stepAIC, stepAICsig
Here we perform a stepwise model selection by AIC (Akaike Information Criterion) based on logistic regression. Details are available here. Performed on all features of training set as well as features initially selected in DE (significant in DE).
No: 54
stepAIC_SMOTE, stepAICsig_SMOTE
Here we perform a stepwise model selection by AIC (Akaike Information Criterion) based on logistic regression. Details are available here. Performed on all features of training set as well as features initially selected in DE (significant in DE) after balancing the training set with SMOTE.
No: 55
iteratedRFECV, iteratedRFETest
Iterated RFE tested in cross-validation and on test set (watch out for bias!). See the source here.
No: 56
iteratedRFECV_SMOTE, iteratedRFETest_SMOTE
Iterated RFE tested in cross-validation and on test set (watch out for bias!). See the source here. Performed after balancing the training set with SMOTE.
No: 57
LASSO, LASSO_SMOTE
Feature selection based on LASSO (Least Absolute Shrinkage and Selection Operator) model with alpha = 1 - penalizes with L1-norm; with 10-fold cross-validation. See the source here. Performed on originial training set and after balancing the training set with SMOTE.
No: 58
ElasticNet, ElasticNet_SMOTE
Feature selection based on elastic net with tuning the value of alpha through a line search. See the source here. Performed on originial training set and after balancing the training set with SMOTE.
No: 59
stepLDA, stepLDA_SMOTE
Forward/backward variable selection (both directions) for linear discriminant analysis. See the source here. Performed on originial training set and after balancing the training set with SMOTE.
No: 60
feseR_filter.corr, feseR_gain.inf, feseR_matrix.corr, feseR_combineFS_RF, feseR_filter.corr_SMOTE, feseR_gain.inf_SMOTE, feseR_matrix.corr_SMOTE, feseR_combineFS_RF_SMOTE
Set of feature selection methods embeded in feseR package published by Perez-Rivelor et al. All default parameters are used, but mincorr is set to 0.2. See the paper here. Performed on originial training set and after balancing the training set with SMOTE.

Those methods can be applied via GUI or via OmicSelector_OmicSelector() function in the R package.

Benchmarking (data-minig modelling methods)

The GUI offers server data-mining algorithms which can be used in benchmarking:

ID Description
glm Logistic regression (generalized linear model with binomial link function).
mlp Multilayer perceptron (MLP) - fully connected feedforward neural network with 1 hidden layer and logistic activiation function. Details: code, package.
mlpML Multilayer perceptron (MLP) - fully connected feedforward neural network with up to 3 hidden layers and logistic activiation function. Details: code, package.
svmRadial Support vector machines with radial basis function kernel. Details: code, package.
svmLinear Support vector machines with linear kernel. Details: code, package.
rf Random forest. Details: code, package.
C5.0 C5.0 decision trees and rule-based models. Details: code, package.
rpart CART decision trees with modulation of complexity parameter. Details: code, package.
rpart2 CART decision trees with modulation of max tree depth. Details: code, package.
ctree Conditional inference trees. Details: code, package.
xgbTree eXtreme gradient boosting. (note: this is a time-consuming method). Details: code, package.

However, the OmicSelector_benchmark() function works using caret, meaning that every model from the (caret list of methods)[https://topepo.github.io/caret/available-models.html] can be applied (assuming that the depending packages are installed; see the reference of OmicSelector_benchmark() for more details).

Note that the package performs the random search of hyperparameters. The best set of hyperparameters is chosen based on the performance on testing set (holdout validation) or strictly on training set (using cross-validation).