Chemometric modelling: 10 specific feature selection

Process flow - specificFeatureSelection

Specific feature selection (specificFeatureSelection) are the most fine tuned and advanced methods for selecting covariates. This also means that they require more computing power. It is thus advisable to reduce the number of covariates entering this step in the process flow.

There are three (3) specfic feature selection methods aviable, but only one can be applied in each model formulation:

Univariate selection (univariateSelection),
Permutation selector (permutationSelector), and.
Random Feature Elemination (RFE)

If a specific feature selection is applied it is the last step before the regression modeling.

|____SpectralData
| |____filter
| | |____singlefilter
| | |____multiFilter
| |____dataSetSplit
| | |____spectralInfoEnhancement
| | | |____scatterCorrection
| | | |____standardisation
| | | |____derivatives
| | | |____decompose
| | |____generalFeatureSelection
| | | |____varianceThreshold
| | |____targetFeatureExtract
| | | |____removeOutliers
| | | |____regressorExtract
| | | | |____specificFeatureAgglomeration
| | | | | |____wardClustering
| | | | |____specificFeatureSelection
| | | | | |____univariateSelection
| | | | | |____permutationSelector
| | | | | |____RFE

Introduction

Univaraite Feature Selection

Univariate feature selection uses an F-test as default for calculating p-values (univariate scores) for each covariate. Model fitting for calculating the scores is done against the target feature. You have to define the number of covariate feature to retain a-priori (parameter n_features) To apply the Univariate feature selection as part of the process flow, edit the command file thus:

"specificFeatureSelection": {
    "apply": true,
    "univariateSelection": {
      "apply": false,
      "SelectKBest": {
        "apply": true,
        "n_features": 4
      }
    }

Figure 1 shows the outcomes of selecting 4 covariates from meancenterd spectra (left) derivatives (middle) and PCA decomposed bands (right). In all cases total nitrogen [N] was set as the target feature. The top row shows the selection of covaraites for the Ordinary Least Square (OLS) regressor; the bottom row for the Random Forest regressor.

Figure 1. Ward clustering of covariates; From top to bottom the rows show clustering of meancentred spectral signals (top), clustering of derivatives (middle) and clustering after PCA decompositions (bottom); the left columns show clustering with a fixed number of output clusters and the right columns after applying a tuning for deciding the optimal number of clusters.

Permutation Selection

Permutation feature importance measures the strength of the contribution of each covariate to a fitted model’s statistical performance. The covariates are shuffled randomly and the change in model statistical perforance when omitting a covaraites defines its strength. This generic method of evaluating covariates can be applied ultiple times to any regressor and can thus be used for covariate selection for any combination of target feature and regressor.

Figure 2 shows the outcomes of selecting 4 covariates from meancenterd spectra (left) derivatives (middle) and PCA decomposed bands (right). In all cases total nitrogen [N] was set as the target feature. The top row shows the selection of covaraites for the Ordinary Least Square (OLS) regressor; the bottom row for the Random Forest regressor.

Figure 2. Ward clustering of covariates; From top to bottom the rows show clustering of meancentred spectral signals (top), clustering of derivatives (middle) and clustering after PCA decompositions (bottom); the left columns show clustering with a fixed number of output clusters and the right columns after applying a tuning for deciding the optimal number of clusters.

RFE

Feature ranking with recursive feature elimination.

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Process flow - specificFeatureSelection

Introduction

Univaraite Feature Selection

Permutation Selection

RFE

RFECV