introduction
This post is an attempt to test Machine Learning for predicting soil properties from spectra data.
Prerequistits
The prerequisites are the same as in the previous posts in this series: a Python environment with numpy, pandas, sklearn (Scikit learn) and matplotlib installed.
Module Skeleton
The module skeleton code will be made avaialble.
The complete code of the module that you created in this post is available at GitHub.
Material and methods
The data used is from OSSL covering south central Sweden.
Results
NIR broadband models
Soil Organic Carbon (SOC)
1 degree train/test
1 degree kfold (n=10)
2 degree train/test
The 2-degree models, created by using sklearn.preprocessing.PolynomialFeatures get 77 co-variables upon expansion of the origianl 11 bands. These models are thus massively overparameterised.
2 degree kfold
The 2-degree models, created by using sklearn.preprocessing.PolynomialFeatures get 77 co-variables upon expansion of the origianl 11 bands. These models are thus massively overparameterised.
1 degree kfold with after feature selection
Using 11 consecutive bands probably causes overfitting as the bands are closely correlated. To reduce the number of covariates (bands) a simple method to use is to only retain those covariates that have a variance above a given threshold. In this example I have uses the variance threshold feature selection method and removed 8 bands and only retained the 3 bands with the highest variance for model development.
2 degree kfold with after feature selection
Using 11 consecutive bands probably causes overfitting as the bands are closely correlated. To reduce the number of covariates (bands) a simple method to use is to only retain those covariates that have a variance above a given threshold. In this example I have uses the variance threshold feature selection method and removed 8 bands and only retained the 3 bands with the highest variance for model development.
1 degree kfold with after KBest feature selection
Here I have used KBest to retain the 2 bands with highest score. The retained bands (660 and 700 nm) differ from the bands with the highest variance (940, 980, and 1020 nm).
2 degree kfold with after KBest feature selection
1 degree kfold after RFE feature selection
1 degree kfold after RFECV feature selection
RFECV inlcudes an internal function for selecting an optimal set of covariates. The selection depends on the model used and thus the nr of feature (covariates) varies.
Resources
Polynomial Regression in Python using scikit-learn, by Tamas Ujhelyi.