Process flow - varianceThreshold
In the process General feature selection (varianceThreshold) includes a single process variance threshold selector (varianceThreshold). This process for selecting covariates does not relate to neither the target feature nor the regressor. It is does more general compared to other feature selection functions. Thus it is positioned prior to linking the covariates (x variables) to any target feature (y variable). The position of the process in the chain is indicated in the schematic flow chart below.
|____SpectralData
| |____filter
| | |____singlefilter
| | |____multiFilter
| |____dataSetSplit
| | |____spectralInfoEnhancement
| | | |____scatterCorrection
| | | |____standardisation
| | | |____derivatives
| | | |____decompose
| | |____generalFeatureSelection
| | | |____varianceThreshold
Introduction
A flat (constant signal) spectrum carries a minimum of information while spectrum with distinct troughs and peaks carries abundant information. Ignoring the chemometric target, the variance of a spectral signal can be used as a quick method for reducing the number of spectra by discarding those that carries the least information. This is exactly what is done by the variance threshold selector method implemented in the process flow. Because the method requires neither a target feature nor a regressor, it is more general (but also more crude) compared to other covariate selection methods. It is thus placed in its own sub-category, generalFeatureSelection. The other covariate (or feature) selection methods that require either target or both target and regressor for selecting covariates, can not be applied in combination. The generalFeatureSelection method of variance threshold can, however always be applied as an initial feature selection also when applying a more specific selector (under specificFeatureSelection in the json commands).
Compared to the original Scikit leaern (sklearn) variance threshold selector, the threshold in the process flow can be parameterised in three different ways:
- by giving a fraction that directly relates to the variation (where variation depends on the original spectra itself as well as the preprocessing) (e.g. 0.02),
- by giving an integer for discarding a fixed number of bands (e.g. 5)
- by giving an integer followed by a percent sign [%] for discarding a percentage of the input bands (e.g. 50%).
The most common approach when applying variance thresholds for reducing the number covariates is to standardise the covariates, e.g. by autoscaling or a minmaxscaler. In the process flow the covariates can be autoscaled in the standardisation step. In the variance threshold process you can either use None scaler or apply a MinMaxScaler. But as discussed in the standardisation step, this kind of rescaling tends to increase the noise and thus also the risk of selecting the most noisy bands rather than those carrying relevant information (see illustration in Figure 1).
To support setting a threshold you can set the argument onlyShowVarianceList to true, that will stop the process flow with a list of the variance of all covariates:
"generalFeatureSelection": {
"apply": true,
"varianceThreshold": {
"apply": true,
"onlyShowVarianceList": true,
"scaler": "MinMaxScaler",
"threshold": "50%"
}
}
The response in this example are covariates denoted with a v followed by the wavelength in nanometer - which is how the process flow labels derivatives:
band (variance)
d870 (0.011)
d430 (0.012)
d710 (0.015)
d670 (0.018)
d750 (0.019)
d790 (0.020)
d630 (0.022)
d830 (0.023)
d590 (0.027)
d470 (0.029)
d550 (0.034)
d510 (0.037)
The response is sorted after the variance in each covariate. Inspect the variance of the covariates, set the threshold (50% in the example below) and change onlyShowVarianceList to false to run the thresholding of covariates:
"generalFeatureSelection": {
"apply": true,
"varianceThreshold": {
"apply": true,
"onlyShowVarianceList": false,
"scaler": "MinMaxScaler",
"threshold": "50%"
}
}
To skip the MinMaxScaler set scaler to None:
"generalFeatureSelection": {
"apply": true,
"varianceThreshold": {
"apply": true,
"onlyShowVarianceList": false,
"scaler": "None",
"threshold": "50%"
}
}
The result of the two argument settings above are illustrated as the middle row in Figure 1, that also illustrates variance selection from the original spectral signals (top row) and from a PCA decomposition (bottom row). Each selection is done both without (left columns) and with (right columns) the MinMaxScaler.