This post is part of a series on organising and analysing data from the Open Soil Spectral Library (OSSL). It is also the first post in a sub-series on how to apply Machine Learning (ML) for predicting soil properties from spectral data. The following posts deal with different aspects of data mining applying ML:
To run the scripts used in this post you need to setup a Python environment, and clone or download the python scripts from a GitHub repository (repo), as explained in the post Clone the OSSL python package.
Introduction
As the Open Soil Spectral Library (OSSL) contains thousands of individual samples, each with thousands of recorded reflectances in different wavelengths, and additional physical and chemical soil properties, it is a suitable dataset for applying Machine Learning (ML) modelling. This post both illustrates how to apply ML for modelling soil properties from soil spectral data, and is at the same time a hands-on manual for using the python script OSSL_mlmodel.py.
Prerequisites
To follow the hands-on instructions, this post requires that you completed the processing as outlined in the posts on downloading and importing the OSSL spectral data. You must also have access to a Python interpreter with the packages matplotlib and scikit learn (sklearn) installed.
This post is more of a stand-alone explanation of how to setup processing using the python module OSSL_mlmodel; the post Run ossl-xspectre modules instead starts from the structure of a prepared example of OSSL-data, also accessible from GitHub.
Machine Learning
Machine Learning (ML) includes a range of methods for preparing, refining, selecting, aggregating, evaluating and modelling different dependent properties from a set of independent features (covariates). For the ML modelling of soil properties from spectral data I have built a predefined process-flow structure. The process-flow is governed by a command file using json syntax. Editing the json file you can chose which target features, regressors and steps to include in your own ML modelling.
Python Module OSSL_mlmodel.py
Running the OSSL_plot.py script is similar to running the import script, and requires specifications of the paths and names of 1) the OSSL data and 2) the command files that define what you want to plot and some minimal layout options:
- rootpath: full path to folder with a downloaded OSSL zip file; parent folder to “sourcedatafolder”, “arrangeddatafolder”, and “jsonfolder”
- sourcedatafolder: subfolder under “rootpath” with the exploded content of the OSSL zip file (default = “data”)
- arrangeddatafolder: subfolder under “rootpath” where the imported (rearranged) OSSL data will be stored
- jsonfolder: subfolder under “rootpath” where the json model parameter files must be located
- projFN: the name of an existing json file that sequentially lists json model parameter files to run, must be directly under the “arrangeddatafolder”
- targetfeaturesymbols: the name of an existing json file that defines the symbolisation of the target features to plot
- targetfeaturetransforms: the name of an existing json file that defines the transformations of the target features
- createjsonparams: if set to true the script will create a template json file and exit
NOTE that in earlier version (before November 2023), the projFN argument was a text (.txt) file but is now changed to a json (.json) file.
json specification file
All of the paths and names listed above must be specified in a json file, and the local path to this json file is the only parameter that is required when running the OSSL_mlmodel.py script. The json specification file for modelling the data over Sweden that were downloaded and then imported looks like this:
{
"rootpath": "/path/to/OSSL/Sweden/LUCAS",
"sourcedatafolder": "data",
"arrangeddatafolder": "arranged-data",
"jsonfolder": "json-ml-modeling",
"projFN": [
"ml-model.json"
],
"targetfeaturesymbols": "/path/to/targetfeaturesymbols.json",
"targetfeaturetransforms": "/path/to/targetfeaturetransforms.json",
"createjsonparams": false
}
The paths/names of the OSSL data are those that you set when you downloaded and exploded in the download post. Before you can model any data you must create 1) a json command file (or files) defining how to model the OSSL data, and 2) a json project file that specifies the name of this json command file (or files). The reason that the direct link to the command file(s) is not given is that the intermediate project file can link to any number of json command files. You can thus run multiple models for one and the same dataset, or run models for multiple datasets using a single project file and a single run. This setup also allows you to create a multi project comparison.
The first time you use the script you must copy or create and then edit the json command file(s). The script can generate a template command file for you, or you can download an example (the data over Sweden used in the previous posts) from a GitHub repo.
To generate a template set the rootpath and change the parameter createjsonparams to true.
{
"rootpath": "/path/to/OSSL/Sweden/LUCAS",
"sourcedatafolder": "data",
"arrangeddatafolder": "arranged-data",
"jsonfolder": "json-ml-modeling",
"projFN": [
"ml-model.json"
],
"targetfeaturesymbols": "/path/to/targetfeaturesymbols.json",
"targetfeaturetransforms": "/path/to/targetfeaturetransforms.json",
"createjsonparams": true
}
Run script
To run a script, open a terminal window. Change directory (cd) to where you downloaded the OSSL_mlmodel.py script (or give the full path).
Before you can run the script you probably have to set the script to have execution rights on your local machine. In MacOS and Linux you do that with the chmod (change mode) command:
chmod OSSL_mlmodel.py 755
Then you can run the script with the full, local path to the json file above as the only parameter:
For MacOS and Linux:
python OSSL_mlmodel.py "/local/path/to/docs-local/OSSL/model_ossl.json"
For Windows you need to state the full path to the conda virtual environment (not only “python” as for MacOS and Linux):
"X:/Local/path/to/anaconda3/envs/ossl_py38a/python.exe" OSSL_mlmodel.py "/local/path/to/model_ossl.json"
With the parameter createjsonparams set to true the script will report that a template file was created:
json parameter file created: /Users/thomasgumbricht/docs-local/OSSL/Sweden/LUCAS/arranged-data/json-ml-model/template_model_ossl-spectra.json
Edit the json file for your project and rename it to reflect the commands.
Add the path of the edited file to your project file (ml-model_spectra.txt).
Then set createjsonparams to False in the main section and rerun script.
json command file structure
The json command file that defines the plotting of the OSSL data is very long and explained in bits and pieces in the rest of this tutorial. If you want to preview the complete structure it is under the Hide/Show toggle.
You have to edit the template to correspond to your OSSL dataset and your ideas on how to model the data. Details on how to edit command files are given in the posts on Import OSSL data and Plot OSSL data.
Selecting target features and regressors
The first you need to do before running the process-flow is to define the target features to predict and the regressors (ML models) to apply.
Target features
The target features to evaluate must of course correspond to properties that are available in the input data. The physical and chemical soil properties available are listed in the json output from when you arranged (imported) the OSSL data. Each re-arranged data point lists the laboratory observed soil properties under the tag abundances:
"abundances": [
{
"substance": "caco3_usda.a54_w.pct",
"value": 0.1
},
...
...
{
"substance": "silt.tot_usda.c62_w.pct",
"value": 28.0
}
]
}
The target features to choose from are those under the substance tags.
In the json command file for the process-flow, list the target features you want to include, for example:
"targetFeatures": [
"cec_usda.a723_cmolc.kg",
"ec_usda.a364_ds.m",
"k.ext_usda.a725_cmolc.kg",
"n.tot_usda.a623_w.pct",
"oc_usda.c729_w.pct"
],
Regression models
You need to include at least one (1) regressor to either run a prediction or evaluate the importance of the covariates (i.e. the band reflectance or the derivatives). The present version of the process-flow implements the following regressors:
- Oridnary Least Square (OLS),
- Theil-Sen Regressor (TheilSen),
- Huber Regressor (Huber).
- k-nearest neighbors regressor (KnnRegr),
- decision tree regressor (DecTreeRegr),
- support vector regressor (SVR),
- random forest regressor (RandForRegr),
- multi-layer perceptron regressor (MLP), and
- Cubist
Where MLP is a neural network type of regressor and Cubist a rule-based regressor.
In the json command file, each regressor must be given with its abbreviation and hyper-parameter (HyperParams) settings. If no hyper-parameters are give you still have to include the empty curly brackets (dictionary in python terms):
"regressionModels": {
"OLS": {
"apply": true,
"hyperParams": {}
},
...
...
"RandForRegr": {
"apply": true,
"hyperParams": {
"n_estimators": 30
}
}
},
Do not worry about the hyper-parameters yet, just accept the defaults. If you want to try to tune the hyper-parameters, the post hyper-parameter tuning will guide you.
Setting the model version
Machine Learning modelling in general becomes an iterative process where you change a few options to tweak the covariates and the model parameterisation in each loop. To facilitate comparing different results, without having to import (rearrange) the same data more than once, you can give a prefix with any model setting. All outputs when running the script OSSL_mlmodel.py will be saved with this prefix. As also the json parameter file is saved as a copy, but with the prefix added, you can edit the original json command file, change the prefix and then run again. All the settings of all previous trials will be saved separelty, as long as you change the prefix. Then you can compare the results before iterating again. The prefix for any model run is set as an output parameter:
"output": {
"prefix": "my1stModel",
"setdate": false
},
Spectral data preprocessing and fitting
In March 2024 I started adding preprocessing steps for improving the spectral input data. The objective for applying preprocessing is to remove noise and bias and other signal artefacts that are unrelated to the properties we want to predict, for instance:
- light scatter,
- baseline drift, or
- background effects.
In other words, to refine the spectral signals to variations that are important for your prediction.
Getting rid of irrelevant information and make sure the data is representate in a way that is feasible for identifying the properties we are interested in.
Much of the inspiration for the development of the preprocessing steps comes from the youtube series Chemometrics & Machne Learning in Copenhagen:
- Preprocessing 1. Centering & Scaling,
- Preprocessing 2. Normalization, SNV and MSC
- Preprocessing 3. Derivatives and baseline, and
- Preprocessing 4. Warping data
In the xSpectre process flow, the spectra preprocessing steps are divided in two broad groups:
- Preparatory (unsupervised) preprocessing; processes that are applied equally during model formulation and independent predictions, and
- Fitted (supervised) preprocessing; processes that require parameter values surviving from the model formulation to the independent predictions.
Preparatory preprocessing
The preparatory preprocessing steps include:
- scatter correction (scattercorrectioncode),
- moving average (movingaverage),
- moving average clusters (movingaveragecluster),
- average clusters (averageclusters), and
- band ranging (bandranging).
The methods can be used in sequence, but only one of the three averaging/clustering processes (movingaverage, movingaverageclusters and averageclusters) can be applied in any particular model.
These processes are not yet implemented in process-flow.
Fitted preprocessing
The fitted preprocessing steps include:
- splice correction (splicecorrection),
- derivates including different smoothing functions (derivatives),
- standardisation including different autoscaling algorithms and meancentring (standardise), and
- principal component analysis (pca).
The fitting steps can be used in sequence.
In March 2024 only standardise and pca are implemented. Derivatives is part of modeling process chain (see below) but will be transformed to a fitting preprocessing step.
Standardisation
standardisation can be parameterised to achieve the following data preparations:
- mean centring,
- auto-scaling (z-score normalisation),
- pareto scaling, and
- poisson scaling.
mean centring
In the model formulation, meancentring calculates an average spectrum for all the spectra belonging to a campaign. For each wavelength the average spectrum signal is then subtracted from each original spectra. This in effect removes any offset in the original spectra and forces an average signal of zero at each wavelength. This has several advantages, and few, if any, drawbacks, especially if you apply a principal component analysis (pca) as part of the spectra preprocessing.
Meancentring is applied by setting the parameters for standardisation accordingly:
"standardisation": {
"apply": true,
"paretoscaling": false,
"poissonscaling":false,
"meancentring": true,
"unitscaling": false
},
Auto-scaling (Z-score normalization)
Auto-scaling (or z-score normalisation, sometimes also just called standardisation) first applies meancentring and then rescales the standard deviation (std) to unity. To do an ordinary auto-scaling set the standardisation parameters:
"standardisation": {
"apply": true,
"paretoscaling": false,
"poissonscaling":false,
"meancentring": true,
"unitscaling": true
},
https://www.kdnuggets.com/2020/04/data-transformation-standardization-normalization.html
Pareto scaling
Pareto scaling scales each variable by the square root of the standard deviation without applying any meancentring.
To do a Pareto scaling set the standardisation parameters:
"standardisation": {
"apply": true,
"paretoscaling": true,
"poissonscaling":false,
"meancentring": false,
"unitscaling": false
},
Actually, Pareto scaling has precedence over the other options and if paretoscaling is set to true the other alterantives are forced to false.
Poisson scaling
Poisson scaling (also known as square root mean scaling or “sqrt mean scale”) scales each variable by the square root of the mean without applying meancentring. An offset is sometimes used for adjusting variables with near-zero values. The offset option is not implemented in the present version of the xSpectre process-flow.
To do a Poisson scaling set the standardisation parameters:
"standardisation": {
"apply": true,
"paretoscaling": false,
"poissonscaling": true,
"meancentring": false,
"unitscaling": false
},
Poisson scaling has precedence over meancentring and unitscaling and if poissonscaling is set to true the latter are forced to false.
pca
Principal component analysis (pca) converts the input bands to vectors sequentially capturing the remaining variation in the source band data. When applying pca the user needs to define the number of components to retain.
In most cases pca is by default applied following autoscaling. In the xSpectre process-flow the user needs to set autoscaling manually in the previous step for the pca to calculate the reprojected components from normalised data.
"pcaPreproc": {
"apply": true,
"n_components": 8
},
Setting the process steps
The general steps in the process-flow are:
- Target feature modifications
- Global data cleaning and selection
- Target property related feature selection
- Model related feature selection
- Feature importance evaluation
- Hyper-parameter tuning
- Model fitting and evaluation
The next sections go through the general steps outlined above. If you just want to run the OSSL_mlmodel package and do not need the instructions, just jump to the section Complete process-flow for Swedish OSSL data. Or go to the post Run ossl-xspectre modules.
Target feature modifications
Classical regressions, like Ordinary Least Square (OLS) perform best with data that is normally distributed (Gaussian). Many soil properties are far from normally distributes, as evident from figure 2 in the previous post on plotting. Mathematical transformation of datasets with skewed distributions and statistical standardisation of data with deviating kurtosis can improve model performance.
The process-flow can apply both mathematical transformation and statistical standardisation of target features as a pre-process. The setting of target feature modifications is done in a separate json file “targetfeaturetransforms”, defined under the input tag in each modelling command file:
"input": {
...
...
"targetfeaturetransforms": "/path/to/targetfeaturetransforms.json",
}
As the modification of target features is likely to be similar across different regions and datasets, the json file defining these transformations is not part of a particular project. You can use the same definitions, and file, across different projects.
Mathematical transformations
The following mathematical transformations of the target features are implemented in the present version of the process-flow:
- logarithmic (log),
- square root (sqrt),
- reciprocal,
- Box-Cox (boxcox), and
- Yeo-Johnson (yeojohnson).
If none of the above transformation is set, the data will be modelled in its original (“linear”) form. The command file lists all the possible mathematical transformations for each target feature, and uses a Boolean (true/false) argument:
"targetFeatureTransform": {
"caco3_usda.a54_w.pct": {
"log": false,
"sqrt": false,
"reciprocal": false,
"boxcox": false,
"yeojohnson": false
},
In the above example all transformations are set to false which results in no transformation and the original data will be used as the target to predict in the modelling. If more than one transformation is set to true, the first in the order above (log - sqrt - reciprocal - boxcox - yeojohnson) will be applied.
The log and sqrt transformations have the widest application. Reciprocal (1/x) is less used. Box-Cox and Yeo-Johnson are power transfomers where the Yeo-Johnsson is a more general solution making Box-Cox redundant (Box-Cox only applies to positive data whereas Yeo-Johnson supports both positive and negative data). The Box-Cox is still included as it is more widely known.
Statistical standardisation
Z-score standardisation transforms a dataset to zero mean and then assigns each value a score that relates to its deviation from the mean expressed in terms of variance. You can apply this statistical standardisation to any data, inlcuding data that has been transformed mathematically. In the json command file you apply the Z-score standardisation by setting the Boolean argument for that target feature to true:
"targetFeatureStandardise": {
"caco3_usda.a54_w.pct": true
},
Global data cleaning and selection
The OSSL data contain a large number of samples, each with hundreds or even thousands of recorded wavelength reflectances. The data can contain errors (outliers) and using all the data can lead to over-parameterised models. It is also inevitable that some wavelengths (or bands) contain less information compared to others; a band with no variation (i.e. constant reflection) does not contain any relevant information for ML modelling.
The global data cleaning and selection methods analyse the independent features (the covariates) disregarding both the target and the ML regressor used for estimation of the target variations. It is a more crude method for discarding data compared to the methods that compare the covariates in relation to the target and the estimator (the two following steps in the process-flow). The global methods are, however comparatively fast and applying the global methods means that all subsequent processing in the process-flow will use the cleaned and reduced dataset and thus also become faster.
Outlier detection and removal
To remove outliers the process-flow implements four different outlier detectors available in the package scikit learn:
In the json command file you turn the outlier detection and removal on by the following lines:
"removeOutliers": {
"apply": true,
"detector": "LocalOutlierFactor",
"contamination": 0.1
},
The only parameter that can be changed in the present version is contamination (applying to the parameter nu in the detector OneClassSVM).
Variance threshold feature selection
Applying the sklearn method VarianceThreshold removes all low-variance features with variances below a stated threshold. The removal is unrelated to the target features and the regressor. To neutralise the range of the input features the sklearn MinMaxScaler is applied as a pre-process. The only parameter that you can set in the process-flow command file is the threshold for retaining or discarding a feature. You will probably have to iteratively test different thresholds. In the json command you include variance feature selection by these lines:
"globalFeatureSelection": {
"apply": true,
"scaler": "MinMaxScaler",
"varianceThreshold": {
"threshold": 0.025
}
},
Target property related feature selection
The global feature selection using variance threshold (the section above) is unrelated to the properties that are to be predicted. To select the features (covariates) that relate to a specific target (soil chemical and physical properties in this example) the process-flow implements univariate feature selection. Scikit learn includes several selectors that can be used with the univariate feature selection, but at present only the the KBest selector is implemented in the process-flow.
Another route for reducing the number of features is to cluster (agglomerate) them. Features that show similar variations in relation to the target feature are collected to a single feature. For the process-flow I have written a Ward clustering routine that also includes a tuning function for determining the optimum number of clusters to form as a preprocess to the Ward clustering.
In the process-flow you can in principle invoke both the univariate feature selection and the clustering, but in normal cases you would only use one for each model building exercise.
Univariate feature selection
There are several methods available for univariate feature selection. In the present version of the process-flow I have included SelectKBest. The only parameter to set is n_features, the number of features to retain from the selection. To invoke the KBest univariate feature selection in the process-flow, edit the json command file like this:
"targetFeatureSelection": {
"apply": true,
"univariateSelection": {
"apply": true,
"SelectKBest": {
"apply": true,
"n_features": 15
}
}
},
The KBest selection applies a univariate linear regression tests returning F-statistic and p-values. It is set separately for each target feature (soil property) to model and the selection of KBest features for one target property does not affect the selection of features for other soil properties.
Feature clustering
Scikit learn contains a range of methods for clustering, or feature agglomeration. The Ward clustering implemented in the process-flow has the advantage that you can tune the number of clusters to request from the main agglomeration. To include the Ward clustering in the process-flow, edit the json command file thus:
"featureAgglomeration": {
"apply": true,
"wardClustering": {
"apply": true,
"n_clusters": 0,
"affinity": "euclidean",
"tuneWardClustering": {
"apply": true,
"kfolds": 3,
"clusters": [
4,
5,
6,
7,
8,
9,
10,
11,
12
]
}
}
},
In the example above I have asked the tuning function to evaluate all optional cluster sizes between 4 and 12, and set the tuning process to a kfold strategy of 3 folds. As the function will seek an optimal number of clusters (for each target feature), I have set the n_clusers parameter for the main wardClustering to 0. If tuneWardClustering is not requested, that number must instead be set to the actual number of clusters requested from wardClustering.
Model related feature selection
The most advanced options for reducing the number of covariates consider both the target property and the applied regressor. The process-flow includes two methods that can be used for this kind of model related feature selection: Permutation Importance Selection and Recursive Feature Elimination (RFE). Permutation importance can be applied to all kinds of regressors, whereas RFE is not applicable for example for KnnRegr or MLP. if you request RFE for any of these, the process-flow will instead do a Permutation Importance Selection. You can only apply one of the two model related feature selection methods in each modelling exercise. If both are turned on, the process-flow will automatically select permutation importance.
Permutation importance selection
Permutation importance is defined as the variation in a model score when a single feature value is randomly shuffled; the larger the difference in score, the more important is the shuffled feature. Applying the permutation importance selection you can set the permutationRepeats, the steps and then the n_features_to_select. Edit the json command file to invoke permutation importance selection:
"modelFeatureSelection": {
"apply": true,
"permutationSelector": {
"apply": true,
"permutationRepeats": 6,
"n_features_to_select": 12,
"step": 1
},
}
The implemented permutation importance selection methods is also part of the process-flow Feature importance evaluation which allows you to graphically examine the relative importance for different features.
Recursive Feature Elimination
RFE can be applied in two modes, with and without Cross Validation (CV). While RFECV takes a bit longer it is the recommended mode. As noted above RFE does not work with all regressors, and is only applied if the permutation selector is turned off. If RFE/RFECV is invoked, regressors that can not apply RFE/REFCV instead use permutation selection. To use RFE/RFECV as part of your process-flow, edit the following lines of the json command file:
"modelFeatureSelection": {
"apply": true,
"permutationSelector": {
"apply": false,
"permutationRepeats": 6,
"n_features_to_select": 12,
"step": 1
},
"RFE": {
"apply": true,
"CV": true,
"n_features_to_select": 12,
"step": 1
}
}
Note that RFE, and even more so, RFECV, will take a long time if you have large datasets with many features. Thus it is recommended to first select features using either global variance threshold, Univariate feature selection or Feature clustering, or actually a combination of two of these methods. Doing so will reduce the number of features evaluated using RFE/RFECV and considerably speed up the processing time.
Feature importance evaluation
After having selected the features to use (or not) you can choose to evaluate the feature importances for each combination of target feature and regressor included in the project. The evaluation is done both for Permutation importance (the decrease in a model score when a single feature value is randomly shuffled) and for model coefficients or feature importances (depends on the type of regressor).
To include feature importance evaluation edit the json command file:
"featureImportance": {
"apply": true,
"reportMaxFeatures": 12,
"permutationRepeats": 10
}
Except from setting the parameter apply to true to invoke the feature importance evaluation, you also have to give the maximum number of features to include in the reporting, reportMaxFeatures. The reporting will always show the highest ranking features. Then you also have to give the number of permutationRepeats to use in the evaluation. Figure 1 shows the feature importance evaluation for the target feature oc_usda.c729_w.pct (organic carbon). The rows show results for the regressors OLS (Ordinary Least Square) and RandForRegr (Random Forest Regression), while the columns show permutation importance and coefficient(OLS)/estimators(RandForRegr) importance. The error bars show the standard deviation.
Permutation importance evaluation
The permutation importance is evaluated with the same methods applied for the Permutation importance selection. It applies to all regressors.
Coefficient importance
The coefficent importanec (for linear regressors) or feature importance (for e.g. tree and forest regressors) are the relative weights of the features selected for modelling the selected target feature. For the tree/forest based regressors the feature importances can be evaluated statistically (mean and standard deviation), whereas for linear models only a single value is reported. This is reflected in both text reports and plots. Note that some regressors do not generate any coefficients or feature importances (e.g. KnnRegr and MLP) and then only the permutation importance is reported. If a multi-column plot with coefficient importance included, the column for coefficient importance will be blanc for these regressors.
Hyper-parameter tuning
Hyper-parameters are parameters that determine how an estimator learns to find patterns and are not directly learnt within estimators. Put differently, each time an estimator is applied for fitting a model, the training depends on the setting of the hyper-parameters. Tweaking the hyper-parameter setting is thus an important part of fitting ML models to generate the best predictions for a particular target feature. It is, however, a rather complex and time-consuming process and is dealt with in a separate post: Model OSSL data: 3 hyper-parameter tuning.
For this post, I suggest that you leave the hyper-parameter tuning off:
"hyperParameterTuning": {
"apply": false,
"fraction": 0.5,
"nIterSearch": 6,
"n_best_report": 3,
"randomTuning": {
"apply": false
},
"exhaustiveTuning": {
"apply": false
}
}
Model fitting and evaluation
The last step in the process-flow is to predict the selected target features using the defined process-flow. There are two model fitting and testing concepts implemented in the process-flow:
- Dividing the dataset in train and test subsets, and
- Cross validated (Kfold) iterative testing.
You can chose to skip Model fitting and evaluation all together, for instance if your objective is to select which features to use. Or you can choose to run one of the two model test concepts, or both. The editing of the json command file is following the same principles as above; to apply both model test methods, edit the command file like this:
"modelTests": {
"apply": true,
"trainTest": {
"apply": true,
"testSize": 0.3,
},
"Kfold": {
"apply": true,
"folds": 10,
}
}
For the trainTest method you need to give the fraction of the input data to use for testing, as the parameter testSize. For the Kfold method you must give the number of folds.
train/test divided prediction
The train/test divided methods for determining model predication power divides the input dataset in 2 subsets, one for calibrating the setting of the regressor “coefficients”, and a (usually) smaller fraction to use for testing the regressor “coefficients” settings. The division 70/30 percent (0.7/0.3) is widely adopted. The train/test method is less comprehensive compared to the Kfold method.
cross validated prediction
Cross validated prediction splits the data into multiple training and testing subsets and is repeated (folded) as stated by the parameters folds. The test size equals the inversion of folds (i.e if folds = 10 then the test size is 0.1) and each data point is included in the test fraction once and only once.
Complete process-flow for Swedish OSSL data
The complete process-flow for the Swedish OSSL data (downloaded and arranged in the previous posts) is available under the toggle button.
Running the process-flow will generate three different types of results:
- a json formatted report,
- conserved model settings (“pickle” files) for all combinations of regressor and target features, and
- plot of feature importances and model predictions, also for all combinations of regressor and target features (optional).
All output files will have the prefix defined under the output tag, as stated above.
json report
The json report includes the outcome of each step included in the process-flow. In addition it lists all the covariates selected for each combination of regressor and target features; also the mean square error and score of each combination is given. The model setting for the reported results are always saved, as “pickle” (conserved) model states. These “pickle” files can be loaded and the model settings applied to any other dataset that contains the same features and bands (i.e. OSSL data over other countries/regions or sample dates.
The json result file for the Swedish OSSL data and the process-flow defined above is almost 1000 rows and hidden under the toggle button.
Conserved model settings
The conserved model settings are stored in “pickle” files; with one pickle file for each combination of regressor and target feature. The pickle files are under the subfolder pickle. The “pickle” files are in binary format and not intelligible for a human reader. The actual settings and data used are reported in a human intelligible format in the json report.
Image plots
If requested in the json command file, a range of image plots showing feature importances and model performance are saved as png files. the next post in this series gives the details on how to layout and save the results as image plots. You can choose to save png images of each individual plot, or save multi-row/multi-columns where the rows either represent different target features and a single regressor, or different regressors and a single target feature.
If you choose to save the full suite of image plots from the json command file for Swedish OSSL data above, the process-flow will render 25 individual image plots and 9 multi-plot images.
Single plot images generated from a combination of 4 target features and 2 regressor models:
- 4 targets * 2 model = 8 permutation importance plot images
- 4 targets * 2 model = 8 feature importance plot images
- 4 targets * 2 model = 8 train/test model predictions
- 4 targets * 2 model = 8 cross-validated model predictions
Multi plot images generated from a combination of 4 target features and 2 regressor models:
- 4 multi-row target images, and
- 2 multi-row regressor images.
The columns to include in the multi-row plots are defined in the json command file (and explained in the next post. Altogether there are 4 columns that can be included in the multi-row plots:
- permutation importance (permutationImportance),
- feature (coefficient) importance (featureImportance),
- train/test model prediction (trainTest) and
- cross validated model predication (Kfold).
The columns to include are listed in the json command file using the compact text listed above:
"columns": [
"permutationImportance",
"featureImportance",
"Kfold"
]
The individual panels of the multi-row plots are identical to the single panel plots, thus the examples in figures 3 to 5 illustrate the layout of most plots.
Symbolisation and plot layout
In the plot examples above I have set different colours to each target feature and different markers (symbols) for each regressor. Apart from feature and regressor symbolisation, you can also set height and width, including the space distribution between the sub-plots etc. The topic of the next post in this series is how to alter the symbolisation and layout of the plots.