OmicSelector: Available methods of feature selection and benchmarking.

Konrad Stawiski

Department of Biostatistics and Translational Research, Medical University of Lodz, Lodz, Poland (https://biostat.umed.pl)
konrad.stawiski@umed.lodz.pl

Marcin Kaszkowiak

Department of Biostatistics and Translational Research, Medical University of Lodz, Lodz, Poland (https://biostat.umed.pl)
Source: vignettes/metody.Rmd

metody.Rmd

The main purpose of OmicSelector is to give you the set of candidate features for further validation of biomarker study. The package performs feature selection first. In the next step the sets of features are tested in the process called “benchmarking”. In benchmarking we test all of those sets of features (biomarkers) using various data-mining (machine learning) methods. Based on the avarage performance of sets in cross-validation or holdout-validation (testing on test set and/or validation set) we can sugesst which of the signatures (set of features) is have the greatest potential in further validation.

Please note that presented methods below are those avaiable in GUI (via web browser). Those can be further extended with ease by users with intermediate R knowledge. Please refer to our extension manuals (comming soon).

Feature selection methods

ID	Description
No: 1 `all`	Get all features (all features staring with 'hsa').
No: 2 `sig, sigtop, sigtopBonf, sigtopHolm, topFC, sigSMOTE, sigtopSMOTE, sigtopBonfSMOTE, sigtopHolmSMOTE, topFCSMOTE`	Selects features significantly differently expressed between classes by performing unpaired t-test with and without correction for multiple testing. We get: `sig` - all significant (adjusted p-value less or equal to 0.05) miRNAs with comparison using unpaired t-test and after the Benjamini-Hochberg procedure (BH, false discovery rate); `sigtop` - `sig` but limited only to the your prefered number of features (most significant features sorted by p-value), `sigtopBonf` - uses Bonferroni instead of BH correction, `sigtopHolm` - uses Holm–Bonferroni instead of BH correction, `topFC` - selects prefered number of features based on decreasing absolute value of fold change in differential analysis. All the methods are also checked on dataset balanced with SMOTE (Synthetic Minority Oversampling TEchnique) - those formulas which names are appended with `SMOTE`.
No: 3 `fcsig, fcsigSMOTE`	Features significant in DE analysis using unpaired t-test and which absolute log2FC is greater than 1. Thus, features significant and up- or down-regulated in the higher magnitudes. FC - fold-change, DE - differential analysis.
No: 4 `cfs, cfsSMOTE, cfs_sig, cfsSMOTE_sig`	Correlation-based feature selection (CFS) - a heuristic algorithm selecting features that are highly correlated with class (binary) and lowly correlated with one another. It explores a search space in best-first manner, until stopping criteria are met.
No: 5 `classloop`	Classifier loop - performs multiple classification procedures using various algorithms (with embedded feature ranking) and various performance metrices. Final feature selection is done by combining the results. Modeling methods used: support vector machines, linear discriminant a nalysis, random forest and nearest shrunken centroid. Features are selected based on the AUC ROC and assessed in k-fold cross-validation according to the documentation. As this requires time, we do not perform it on SMOTEd dataset.
No: 6 `classloopSMOTE`	Application of `classloop` on balanced dataset (with SMOTE).
No: 7 `classloop_sig`	Application of `classloop` but only on the features which are significant in DE.
No: 8 `classloopSMOTE_sig`	Application of `classloop` on balanced training set and only on the features which are significant in DE (after balancing).
No: 9 `fcfs`	An algorithm similar to CFS, though exploring search space in greedy forward search manner (adding one, most attractive, feature at the time, until such addition does not improve set’s overall quality). Based on Wang et al. 2005 and documented here.
No: 10 `fcfsSMOTE`	Application of `fcfs` on balanced training set.
No: 11 `fcfs_sig`	Application of `fcfs` on features significant in DE.
No: 12 `fcfsSMOTE_sig`	Application of `fcfs` on balanced dataset and on features significant in DE (after balancing).
No: 13 `fwrap`	A decision tree algorithm and forward search strategy documented here.
No: 14 `fwrapSMOTE`	Application of `fwrap` on balanced training set.
No: 15 `fwrap_sig`	Application of `fwrap` on features significant in DE.
No: 16 `fwrapSMOTE_sig`	Application of `fwrap` on balanced dataset and on features significant in DE (after balancing).
No: 17 `AUC_MDL`	Feature ranking based on ROC AUC and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below.
No: 18 `SU_MDL`	Feature ranking based on symmetrical uncertainty and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below.
No: 19 `CorrSF_MDL`	Feature ranking based on CFS algorithm with forward search and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below.
No: 20 `AUC_MDLSMOTE`	Feature ranking based on ROC AUC and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE.
No: 21 `SU_MDLSMOTE`	Feature ranking based on symmetrical uncertainty and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE.
No: 22 `CorrSF_MDLSMOTE`	Feature ranking based on CFS algorithm with forward search and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE.
No: 23 `AUC_MDL_sig`	Feature ranking based on ROC AUC and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Only features significant in DE are allowed.
No: 24 `SU_MDL_sig`	Feature ranking based on symmetrical uncertainty and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Only features significant in DE are allowed.
No: 25 `CorrSF_MDL_sig`	Feature ranking based on CFS algorithm with forward search and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Only features significant in DE are allowed.
No: 26 `AUC_MDLSMOTE_sig`	Feature ranking based on ROC AUC and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE. Only features significant in DE are allowed.
No: 27 `SU_MDLSMOTE_sig`	Feature ranking based on symmetrical uncertainty and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE. Only features significant in DE are allowed.
No: 28 `CorrSF_MDLSMOTE_sig`	Feature ranking based on CFS algorithm with forward search and minimal description length (MDL) discretization algorithm documented here. After the ranking, the number of features are limited as set in options below. Performed on the training set balanced with SMOTE. Only features significant in DE are allowed.
No: 29 `bounceR-full, bounceR-stability`	A component-wise-boosting-based algorithm selecting optimal features in multiple iterations of single feature-models construction. See the source here. `bounceR-stability` gets the most stable features. Wrapper methods implemented here leverage componentwise boosting as a weak learners.
No: 30 `bounceR-full_SMOTE, bounceR-stability_SMOTE`	A component-wise-boosting-based algorithm selecting optimal features in multiple iterations of single feature-models construction. See the source here. `bounceR-stability` gets the most stable features. Wrapper methods implemented here leverage componentwise boosting as a weak learners. Performed on the training set balanced with SMOTE.
No: 31 `bounceR-full_SIG, bounceR-stability_SIG`	A component-wise-boosting-based algorithm selecting optimal features in multiple iterations of single feature-models construction. See the source here. `bounceR-stability` gets the most stable features. Wrapper methods implemented here leverage componentwise boosting as a weak learners. Only features significant in DE are allowed.
No: 32 `bounceR-full_SIGSMOTE, bounceR-stability_SIGSMOTE`	A component-wise-boosting-based algorithm selecting optimal features in multiple iterations of single feature-models construction. See the source here. `bounceR-stability` gets the most stable features. Wrapper methods implemented here leverage componentwise boosting as a weak learners. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 33 `RandomForestRFE`	Recursively eliminates features from the feature space based on ranking from Random Forrest classifier (retrained woth resampling after each elimination). Details are available here.
No: 34 `RandomForestRFESMOTE`	Recursively eliminates features from the feature space based on ranking from Random Forrest classifier (retrained woth resampling after each elimination). Details are available here. Performed on the training set balanced with SMOTE.
No: 35 `RandomForestRFE_sig`	Recursively eliminates features from the feature space based on ranking from Random Forrest classifier (retrained woth resampling after each elimination). Details are available here. Only features significant in DE are allowed.
No: 36 `RandomForestRFESMOTE_sig`	Recursively eliminates features from the feature space based on ranking from Random Forrest classifier (retrained woth resampling after each elimination). Details are available here. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 37 `GeneticAlgorithmRF`	Uses genetic algorithm principle to search for optimal subset of the feature space. This uses internally implemented random forest model and 10-fold cross validation to assess performance of the "chromosomes" in each generation. Details are available here.
No: 38 `GeneticAlgorithmRFSMOTE`	Uses genetic algorithm principle to search for optimal subset of the feature space. This uses internally implemented random forest model and 10-fold cross validation to assess performance of the "chromosomes" in each generation. Details are available here. Performed on the training set balanced with SMOTE.
No: 39 `GeneticAlgorithmRF_sig`	Uses genetic algorithm principle to search for optimal subset of the feature space. This uses internally implemented random forest model and 10-fold cross validation to assess performance of the "chromosomes" in each generation. Details are available here. Only features significant in DE are allowed.
No: 40 `GeneticAlgorithmRFSMOTE_sig`	Uses genetic algorithm principle to search for optimal subset of the feature space. This uses internally implemented random forest model and 10-fold cross validation to assess performance of the "chromosomes" in each generation. Details are available here. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 41 `SimulatedAnnealingRF`	Simulated Annealing - explores a feature space by randomly modifying a given feature subset and evaluating classification performance using new attributes to check whether changes were beneficial. It is is a global search method that makes small random changes (i.e. perturbations) to an initial candidate solution. In this method also random forest is used as a model for evaluation. Details are available here.
No: 42 `SimulatedAnnealingRFSMOTE`	Simulated Annealing - explores a feature space by randomly modifying a given feature subset and evaluating classification performance using new attributes to check whether changes were beneficial. It is is a global search method that makes small random changes (i.e. perturbations) to an initial candidate solution. In this method also random forest is used as a model for evaluation. Details are available here. Performed on the training set balanced with SMOTE.
No: 43 `SimulatedAnnealingRF_sig`	Simulated Annealing - explores a feature space by randomly modifying a given feature subset and evaluating classification performance using new attributes to check whether changes were beneficial. It is is a global search method that makes small random changes (i.e. perturbations) to an initial candidate solution. In this method also random forest is used as a model for evaluation. Details are available here. Only features significant in DE are allowed.
No: 44 `SimulatedAnnealingRFSMOTE_sig`	Simulated Annealing - explores a feature space by randomly modifying a given feature subset and evaluating classification performance using new attributes to check whether changes were beneficial. It is is a global search method that makes small random changes (i.e. perturbations) to an initial candidate solution. In this method also random forest is used as a model for evaluation. Details are available here. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 45 `Boruta`	Boruta - utilizes random forrest algorithm to iteratively remove features proved to be less relevant than random variables. Details are available in paper by Kursa et al. 2010 or this blog post. Only features significant in DE are allowed. Performed on the training set balanced with SMOTE.
No: 46 `BorutaSMOTE`	Boruta - utilizes random forrest algorithm to iteratively remove features proved to be less relevant than random variables. Details are available in paper by Kursa et al. 2010 or this blog post. Performed on the training set balanced with SMOTE.
No: 47 `spFSR`	spFSR - feature selection and ranking by simultaneous perturbation stochastic approximation. This is an algorithm based on pseudo-gradient descent stochastic optimisation with Barzilai-Borwein method for step size and gradient estimation optimization. Details are available in paper by Zeren et al. 2018.
No: 48 `spFSRSMOTE`	spFSR - feature selection and ranking by simultaneous perturbation stochastic approximation. This is an algorithm based on pseudo-gradient descent stochastic optimisation with Barzilai-Borwein method for step size and gradient estimation optimization. Details are available in paper by Zeren et al. 2018. Performed on the training set balanced with SMOTE.
No: 49 `varSelRF, varSelRFSMOTE`	varSelRF - recursively eliminates features using random forrest feature scores, seeking to minimize out-of-bag classification error. Details are available in paper by Díaz-Uriarte et al. 2006. Performed on the unbalanced training set as well as on balanced with SMOTE.
No: 50 `Wx, WxSMOTE`	Wx - deep neural network-based (deep learning) feature (gene) selection algorithm. We use 2 hidden layers with 16 hidden neurons. Details are available in paper by Park et al. 2019. Performed on the unbalanced training set as well as on balanced with SMOTE.
No: 51 `Mystepwise_glm_binomial, Mystepwise_sig_glm_binomial`	Stepwise variable selection procedure (with iterations between the 'forward' and 'backward' steps) for generalized linear models with logit link function (i.e. logistic regression). We use p=0.05 as a threshold for both entry (SLE) and stay (SLS). Details are available here. Performed on all features of training set as well as features initially selected in DE (significant in DE).
No: 52 `Mystepwise_glm_binomialSMOTE, Mystepwise_sig_glm_binomialSMOTE`	Stepwise variable selection procedure (with iterations between the 'forward' and 'backward' steps) for generalized linear models with logit link function (i.e. logistic regression). We use p=0.05 as a threshold for both entry (SLE) and stay (SLS). Details are available here. Performed on all features of training set as well as features initially selected in DE (significant in DE) after balancing the training set with SMOTE.
No: 53 `stepAIC, stepAICsig`	Here we perform a stepwise model selection by AIC (Akaike Information Criterion) based on logistic regression. Details are available here. Performed on all features of training set as well as features initially selected in DE (significant in DE).
No: 54 `stepAIC_SMOTE, stepAICsig_SMOTE`	Here we perform a stepwise model selection by AIC (Akaike Information Criterion) based on logistic regression. Details are available here. Performed on all features of training set as well as features initially selected in DE (significant in DE) after balancing the training set with SMOTE.
No: 55 `iteratedRFECV, iteratedRFETest`	Iterated RFE tested in cross-validation and on test set (watch out for bias!). See the source here.
No: 56 `iteratedRFECV_SMOTE, iteratedRFETest_SMOTE`	Iterated RFE tested in cross-validation and on test set (watch out for bias!). See the source here. Performed after balancing the training set with SMOTE.
No: 57 `LASSO, LASSO_SMOTE`	Feature selection based on LASSO (Least Absolute Shrinkage and Selection Operator) model with alpha = 1 - penalizes with L1-norm; with 10-fold cross-validation. See the source here. Performed on originial training set and after balancing the training set with SMOTE.
No: 58 `ElasticNet, ElasticNet_SMOTE`	Feature selection based on elastic net with tuning the value of alpha through a line search. See the source here. Performed on originial training set and after balancing the training set with SMOTE.
No: 59 `stepLDA, stepLDA_SMOTE`	Forward/backward variable selection (both directions) for linear discriminant analysis. See the source here. Performed on originial training set and after balancing the training set with SMOTE.
No: 60 `feseR_filter.corr, feseR_gain.inf, feseR_matrix.corr, feseR_combineFS_RF, feseR_filter.corr_SMOTE, feseR_gain.inf_SMOTE, feseR_matrix.corr_SMOTE, feseR_combineFS_RF_SMOTE`	Set of feature selection methods embeded in feseR package published by Perez-Rivelor et al. All default parameters are used, but mincorr is set to 0.2. See the paper here. Performed on originial training set and after balancing the training set with SMOTE.

Those methods can be applied via GUI or via OmicSelector_OmicSelector() function in the R package.

Benchmarking (data-minig modelling methods)

The GUI offers server data-mining algorithms which can be used in benchmarking:

ID	Description
`glm`	Logistic regression (generalized linear model with binomial link function).
`mlp`	Multilayer perceptron (MLP) - fully connected feedforward neural network with 1 hidden layer and logistic activiation function. Details: code, package.
`mlpML`	Multilayer perceptron (MLP) - fully connected feedforward neural network with up to 3 hidden layers and logistic activiation function. Details: code, package.
`svmRadial`	Support vector machines with radial basis function kernel. Details: code, package.
`svmLinear`	Support vector machines with linear kernel. Details: code, package.
`rf`	Random forest. Details: code, package.
`C5.0`	C5.0 decision trees and rule-based models. Details: code, package.
`rpart`	CART decision trees with modulation of complexity parameter. Details: code, package.
`rpart2`	CART decision trees with modulation of max tree depth. Details: code, package.
`ctree`	Conditional inference trees. Details: code, package.
`xgbTree`	eXtreme gradient boosting. (note: this is a time-consuming method). Details: code, package.

However, the OmicSelector_benchmark() function works using caret, meaning that every model from the (caret list of methods)[https://topepo.github.io/caret/available-models.html] can be applied (assuming that the depending packages are installed; see the reference of OmicSelector_benchmark() for more details).

Note that the package performs the random search of hyperparameters. The best set of hyperparameters is chosen based on the performance on testing set (holdout validation) or strictly on training set (using cross-validation).