This post is part of a series on organising and analysing data from the Open Soil Spectral Library (OSSL). It is also the second post in a sub-series on how to apply Machine Learning (ML) for predicting soil properties from spectral data. To run the scripts used in this post you need to setup a Python environment, and clone or download the python scripts from a GitHub repository (repo), as explained in the post Clone the OSSL python package.
Introduction
The previous post covered the overall process-flow of the python script OSSL_mlmodel.py and how to get model results both as json documents and as plots. The topic of this post is how to change the symbolisation and plot layout of the graphical results that are derived from OSSL_mlmodel.py. The figures with the plots are all generated with the python library matplotlib.
Prerequisites
To follow the hands-on instructions, this post requires that you completed the processing as outlined in the posts on downloading and importing the OSSL spectral data. You must also have access to a Python interpreter with the packages matplotlib and scikit learn (sklearn) installed.
Matplotlib
matplotlib is a versatile graphics editor and drawer, that can be used for designing and constructing complex plots, including figures with rows and columns of subplots. This post on symbolisation and layout of the results from Machine Learning (ML) modelling only uses a very limited set of matplotlib tools.
Basic layout options
There are 4 basic plot layout options that can be set for displaying the result of the ML modelling in OSSL_mlmodel.py (illustrated in figure 1):
- Single plots showing feature importance for a combination of one feature and one regressor (bar chart),
- Single plot showing the predictive power for a combination of one feature and one regressor (scatter plot),
- Multi-plot rows and columns for a target feature with rows showing the results of different regressors applied for predicting a specific target feature, and
- Multi-plot rows and columns for a regressor with rows showing the results of applying this specific regressor to different target features.
Alternatives 1 (feature importance) and 2 (predictive power) each are available in 2 versions. Feature importance can be both permutation importance and coefficient importance. And predictive power can be both from train/test divided prediction and cross validated prediction. There are thus in total 4 different kinds of individual plots that can be defined (figure 2).
That there are 4 options for individual plots lead to the multi-plots having a maximum of 4 columns - being the 4 possible individual plots. The rows in a multi-plot are defined by the input items of the modelling; for single feature multi-plots the number of rows equals the number of regressors to be tested; for single regressor multi-plots the number of rows equals the number of target features to be modelled. The columns to include in the multi-plots are then defined using the following code words:
- permutationImportance
- featureInportance
- trainTest
- Kfold
Single feature multi-plots
The json coding for creating multi-plots for target features looks like this:
"targetFeatures": {
"apply": true,
"figSize": {
"x": 0,
"y": 0,
"xadd": 0,
"yadd": 0
},
"hwspace": {
"hspace": 0.25,
"wspace": 0.25
},
"columns": [
"permutationImportance",
"featureImportance",
"Kfold"
]
}
The example above creates a 3-column multi-plot for each target feature defined in the json command file, figure 3.
Single regressor multi-plots
The json coding for creating multi-plots for regressors looks like this:
"regressionModels": {
"apply": true,
"figSize": {
"x": 0,
"y": 0,
"xadd": 0.25,
"yadd": 0.25
},
"hwspace": {
"hspace": 0.25,
"wspace": 0.25
},
"columns": [
"permutationImportance",
"trainTest",
"Kfold"
]
}
The example above creates a 3-column multi-plot for each regressor defined in the json command file, figure 4.
Target feature symbolisation
The target features (soil properties) are symbolised either as bars (in the bar-charts showing the feature importance) or as scatters (in the observed versus predicted scatter plots showing model performance). The colour and label for each target feature, symbolised both as bars and scatter markers, are set in the json command file:
"targetFeatureSymbols": {
"caco3_usda.a54_w.pct": {
"color": "whitesmoke",
"label": "CaCo3"
},
"cec_usda.a723_cmolc.kg": {
"color": "seagreen",
"label": "Cation Exc. Cap."
},
"cf_usda.c236_w.pct": {
"color": "sienna",
"label": "Crane fraction"
},
"clay.tot_usda.a334_w.pct": {
"color": "tan",
"label": "Clay cont."
},
"ec_usda.a364_ds.m": {
"color": "dodgerblue",
"label": "Electric cond."
},
"k.ext_usda.a725_cmolc.kg": {
"color": "lightcyan",
"label": "Potassion (K)"
},
"n.tot_usda.a623_w.pct": {
"color": "darkcyan",
"label": "Nitrogen (N) [tot]"
},
"oc_usda.c729_w.pct": {
"color": "dimgray",
"label": "Organic carbon (C)"
},
"p.ext_usda.a274_mg.kg": {
"color": "firebrick",
"label": "Phosphorus (P)"
},
"ph.cacl2_usda.a481_index": {
"color": "lemonchiffon",
"label": "pH (CaCl)"
},
"ph.h2o_usda.a268_index": {
"color": "lightyellow",
"label": "pH (H20)"
},
"sand.tot_usda.c60_w.pct": {
"color": "orange",
"label": "Sand cont."
},
"silt.tot_usda.c62_w.pct": {
"color": "khaki",
"label": "Silt cont."
}
}
Regressor feature symbolisation
If you look carefully at the multi-row plots in figure 3 (showing the prediction of organic carbon using OLS and RandForRegr) you can see that the markers have different shape. The marker shape and size can be set differently for each regressor in the json command file:
"regressionModelSymbols": {
"OLS": {
"marker": ".",
"size": 100
},
"TheilSen": {
"marker": "v",
"size": 50
},
"Huber": {
"marker": "^",
"size": 50
},
"KnnRegr": {
"marker": "s",
"size": 50
},
"DecTreeRegr": {
"marker": "P",
"size": 50
},
"SVR": {
"marker": "*",
"size": 50
},
"RandForRegr": {
"marker": "h",
"size": 50
},
"MLP": {
"marker": "D",
"size": 50
},
"Cubist": {
"marker": "D",
"size": 50
}
},
"modelTests": {
"apply": true,
"trainTest": {
"apply": true,
"testSize": 0.3,
"plot": true,
"marker": "s"
},
"Kfold": {
"apply": true,
"folds": 10,
"plot": true,
"marker": "."
}
}
Horisontal and vertical spacing
For the multi-row/multi-column plots it is a bit tricky to get the height and width spacing correct - the axis labels often tend to overlap. Also because different target features have different units and thus different numerical ranges - which leads to different sizes of the tick mark text. Matplotlib does not handle that automatically and thus I have added the possibility of manually setting both figure height and width and the height and width spacing.
"figSize": {
"x": 0,
"y": 0,
"xadd": 0,
"yadd": 0
},
"hwspace": {
"hspace": 0.25,
"wspace": 0.25
},
If, like in the example above, figSize x, y, xadd and yadd are set to 0 the script will look for default settings under subFigSize:
"subFigSize": {
"x": 3,
"y": 3,
"xadd": 0.1,
"yadd": 0.1
}
If subFigSize is used, the total figure size will be set as:
- width = “nr of columns” * x + xadd
- height = “nr of rows” * y + yadd
Tuning these number it is possible to achieve a plot with nice looking distances between the sub-plots both in the horizontal and the vertical.