PSM_guide.Rmd

---
title: Machine learning workflows for predictive soil mapping 
author: M.G. Walsh, J. Rutebuka and R. Manners
date: "`r format(Sys.time(), '%d, %B, %Y')`"
output:
  html_document:
    toc: true
    toc_depth: 1
    fig_caption: true
    css: style.css
---

```{r, echo=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```

# Introduction

This notebook provides practical guidelines for predictive soil mapping (PSM) using machine learning (ML). The specifics can, and should of course be, modified to fit the purpose of the specific predictive soil mapping tasks of interest. Our main intent here is to provide a reproducible, generalized mapping framework and some of the associated ML computing workflows in [R](https://www.r-project.org/). The actual data and workflows should also be readily transferable to other computing environments if needed. The notebook itself is maintained on [Github](https://github.com/mgwalsh/Soils/blob/master/PSM_guide.Rmd), and you can fork and modify it from there as you see fit.

The first section of the notebook sets up georeferenced soil survey, remote sensing and GIS data of Rwanda for spatial analyses and predictions. We focus on **Rwanda's croplands**, which are the primary [Region of Interest (ROI)](https://en.wikipedia.org/wiki/Region_of_interest) in this context and the target for various land management interventions of the [RwaSIS project](https://minagri.prod.risa.rw/updates/news-details/rwanda-launches-soil-information-service-project-to-improve-agriculture-productivity). Based on a recent high-resolution [GeoSurvey (2019)](https://osf.io/w5e6c/), croplands are currently estimated to occupy ~68% of Rwanda's overall land area (of ~2.37 Mha). The gridded data offer several advantages including: Africa-wide coverage and consistency, and a robust analysis process that ensures high quality predictions for use in your soil mapping applications.

The subsequent sections illustrate prediction workflows, which use ensembles of machine learning algorithms (MLAs). See the various review articles at: [Emsemble models](https://www.sciencedirect.com/topics/computer-science/ensemble-modeling), with various remote sensing and GIS training input and validation layers. There are also some really good free-and-open books (e.g. [Lovelace et al, 2021](https://geocompr.robinlovelace.net/) and [Hengl & MacMillan, 2019](https://soilmapper.org/)) available if you'd like to get into the subject matter at greater depth.

While the [Legacy soil data](https://en.wikipedia.org/wiki/Legacy_system) used in this example were not collected on a reproducible sampling frame, they may provide an initial indication of where specific soil problems or nutrient deficiencies/imbalances are prevalent in Rwanda. As new data become available over the course of the project, RwaSIS and its partners should update the relevant data sources and soil mapping workflows.

In the last section of the notebook, we produce initial (robust) uncertainty estimates of the soil property predictions that are generated by our ensemble model workflow via [Quantile Regression](https://cran.r-project.org/web/packages/quantreg/quantreg.pdf). These estimates show where in Rwanda the current models fit the ground measurements reasonably well and where they do not. As new data are generated by RwaSIS (and others) these uncertainties are likely to decrease, potentially increasing the level of confidence in any resulting cropland management recommendations (e.g. for fertilizer, lime amendments and/or soil erosion/deposition predictions).

# General data setup

To run this notebook, you will need to load the packages indicated in the chunk directly below. This allows you to assemble the different example dataframes and rasters providing a lot of options to generate spatial predictions via machine learning and/or geostatistical algorithms. We encourage you to explore the many options in this context!

```{r}
# Package names
packages <- c("osfr", "rgdal", "sp", "raster", "quantreg", "leaflet", "DT", "devtools", "caret", "caretEnsemble", "mgcv", "MASS", "randomForest", "xgboost", "nnet", "Cubist", "plyr", "doParallel")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))
```

## Data downloads

The following chunk then downloads the data that we have assembled, which are needed for running this particular example. Specifically, it assembles georeferenced field observations of soil measurements, e.g. (pH, soil carbon and soil nutrient composition) and links these to remote sensing and GIS images represented by the `grids` feature stack. We focus on predicting the topsoil pH measurements here, but the approach is equally applicable to any given [Georeferenced](https://en.wikipedia.org/wiki/Georeferencing) soil property or measurement under consideration.

```{r, results = 'hide'}
# Set working directory
dir.create("PSM", showWarnings = F)
setwd("./PSM")
dir.create("Results", showWarnings = F)

# Download soil data
osf_retrieve_file("djcfz") %>% osf_download(conflicts = "overwrite")
unzip("RW_wetchem.zip", overwrite = T)
prof <- read.table("Profiles.csv", header = T, sep = ",")
samp <- read.table("Samples.csv", header = T, sep = ",")
geos <- merge(prof, samp, by="pid")

# Download and assemble raster stacks (https://osf.io/h24wd/)
osf_retrieve_file("h24wd") %>% osf_download(conflicts = "overwrite")
unzip("RW_250m_2020.zip", overwrite = T)
glist <- list.files(pattern="tif", full.names = T)
grids <- stack(glist)

# Download figures
osf_retrieve_file("42gry") %>% osf_download(conflicts = "overwrite")
unzip("figures.zip", overwrite = T)
```

The processed Rwanda remote sensing and GIS data (features, ... `grids` in the chunk above) were derived from and transformed from their primary open sources. You can also download the entire `grids` raster stack directly at [RwaSIS grids](https://osf.io/h24wd/). The short descriptions of the included features, and their sources are provided in the table immediately below.

\
```{r, echo=FALSE, results='hide'}
osf_retrieve_file("e2x4s") %>% osf_download(conflicts = "overwrite")
vars <- read.table("Grid variables.csv", header = T, sep = ",")
```

```{r, echo = FALSE, results = 'asis'}
datatable(vars)
```
\

These Rwanda-wide (actually Africa-wide) features will change over time and we will update them if and when needed. Also note that these are grouped by factor variables that designate cropland soil condition (*x*) as a function of **f(x) ~ (a,c,o,r,s)** where:

*   a - anthropic variables
*   c - climatic variables
*   o - organismal (primarily vegetative/land cover related)
*   r - relief and erosion/deposition related terrain variables
*   s - parent material and soil related variables

## Data assembly

The next chunk then sets up the soil and site properties in a dataframe that will generate the the training and validation subsets for the different MLAs we shall apply to the spatial prediction of soil properties. The soil property data presented here are courtesy of [CROPNUTS](https://cropnuts.com/).

```{r}
# Project legacy data coords to grid CRS
geos.proj <- as.data.frame(project(cbind(geos$lon, geos$lat), "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))
colnames(geos.proj) <- c("x","y")
geos <- cbind(geos, geos.proj)
coordinates(geos) <- ~x+y
projection(geos) <- projection(grids)

# Extract gridded variables at survey locations
geosgrid <- extract(grids, geos)
gsdat <- as.data.frame(cbind(geos, geosgrid))

# Randomize dataframe
set.seed(1235813)
gsdat <- gsdat[sample(1:nrow(gsdat)), ]
```

The following chunk then writes out the randomized dataframe `RW_soil_data.csv` into your `./PSM/Results` directory, if you'd like to process those outputs in software other than R. It also generates a location map of where those soil samples were obtained.

```{r}
# Write data frame
write.csv(gsdat, "./PSM/Results/RW_soil_data.csv", row.names = F)

# Soil sample locations
w <- leaflet() %>%
  setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 8) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(gsdat$lon, gsdat$lat, clusterOptions = markerClusterOptions())
w ## plot widget 
```

# Machine-learning-based predictive mapping with `caret` and `caretEnsemble` {.tabset .tabset-pills}

The following chunks predict topsoil soil pH values using contrasting machine learning algorithms (MLAs) with varying remote sensing and GIS (feature) inputs using the [`caret`](https://topepo.github.io/caret/) and [`caretEnsemble`](https://www.rdocumentation.org/packages/caretEnsemble/versions/2.0.1) packages. The generalized MLA prediction workflow is shown in the figure below. This approach has won a lot of data science competitions e.g., at [Kaggle](https://www.kaggle.com/). You may want to take a look there. They have some fantastic data science resources, courses and challenges openly available.

```{r training_validation_approach, echo=FALSE, fig.align="center", fig.cap="MLA training, validation and prediction workflow.", out.width = '80%'}
knitr::include_graphics("./PSM/training_validation.png")
```

The main idea is to train a number of potentially contrasting models with [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). At the end of the model training processes, the various models are ensembled (combined/stacked) on an *independent* validation dataset. When applied over time and space, this is a form of [Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning), which should produce increasingly accurate predictions as new field, lab data and MLAs are obtained and run.

The following chunk scrubs some of the objects in memory, removes incomplete cases (if there are any), sets-up labels and features, and creates a randomized partition between the training and validation dataframes. Everything is parallelized to facilitate efficient use of either local or cloud-based computing resources. Note that there are other options available for this (e.g. [snowfall](https://cran.r-project.org/web/packages/snowfall/snowfall.pdf), among others.

```{r}
rm(list=setdiff(ls(), c("gsdat","grids","glist"))) ## scrub extraneous objects in memory
gsdat <- gsdat[complete.cases(gsdat[ ,c(24:75)]),] ## removes incomplete cases

# set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)

# split data into calibration and validation sets
gsIndex <- createDataPartition(gsdat$pH, p = 4/5, list = F, times = 1)
gs_cal <- gsdat[ gsIndex,]
gs_val <- gsdat[-gsIndex,]

# Soil calibration labels
labs <- c("pH") ## insert other labels (e.g. "C","N","P","K", "eCEC", ...) here!
lcal <- as.vector(t(gs_cal[labs]))

# raster calibration features
fcal <- gs_cal[c(24:43,47:72)]
```

Note that we are using topsoil (0-20 cm) pH as an example here. You can substitute other soil properties like (SOC, P, K, etc.) as labels, in the `labs` variable. You may want to transform those `labs` variables prior to model fitting, and perhaps also to tune the respective models to better represent the distributional and/or compositional attributes of the individual soil properties, should the need arise. Anyhow, those additional variables are also included in the `RW_soil_data.csv` data file for you to try. Click on the blue tabs below to see how the individual models were trained with k-fold validation and subsequently stacked using the validation (unseen) data.

## Spatial trend

This is a simple spatially smoothed *generalized additive model* applying the `gam` function on the pH values at different sampling locations in Rwanda, based only on their georeference. It is similar to ordinary kriging with cross-validation but it is simpler and much faster to compute in this context.

```{r}
# select locational covariates
gf_cpv <- gs_cal[,44:46]

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(seed)
tc <- trainControl(method = "cv", allowParallel = T)

# model training
gm <- train(gf_cpv, lcal, 
            method = "gam",
            preProc = c("center","scale"), 
            metric = "RMSE",
            trControl = tc)

gm.pred <- predict(grids, gm) ## spatial predictions
stopCluster(mc)
fname <- paste("./PSM/Results/", labs, "_gm.rds", sep = "")
saveRDS(gm, fname)
```

## Central places

Central places are influential variables for places where human impacts occur. They are correlated with both extraction and deposition of soil nutrients and toxic elements, soil erosion and deposition, acidification and many other soil disturbance processes. The model below focuses on central place indicators such as distances to roads and settlements, surface water sources, cell towers, night lights, and electricity networks among other anthropic factors.

```{r, results='hide'}
# select central place covariates
gf_cpv <- gs_cal[c(30:43,62)]

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(seed)
tc <- trainControl(method = "cv", allowParallel = T)

# model training
cp <- train(gf_cpv, lcal, 
            method = "glmStepAIC",
            preProc = c("center","scale"), 
            trControl = tc,
            metric = "RMSE")

cp.pred <- predict(grids, cp) ## spatial predictions
stopCluster(mc)
fname <- paste("./PSM/Results/", labs, "_cp.rds", sep = "")
saveRDS(cp, fname)
```

## Ensemble MLAs

This chunk fits 4 additional models, which use of gridded calibration data with 10-fold cross-validation. You can learn more about how these algorithms work by following links at: [MASS](https://cran.r-project.org/web/packages/MASS/index.html), [randomForest](https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest), [xgboost](https://www.kaggle.com/rtatman/machine-learning-with-xgboost-in-r/) and [Cubist](https://www.rdocumentation.org/packages/Cubist/versions/0.3.0/topics/cubist.default). 

You can use `caretEnsemble` instead of `caret` as long as the feature variables (`grids` in this case), and the `trainControl` methods are the same for each model in the `caretList` function. This shortens the script-length of this notebook but does not otherwise affect the overall `caret` functionality. Note however that the calculations take a bit of time to run on a normal 8-core, 16 Gb memory computer. This is not a big problem for a relatively small ROI like Rwanda, but it might be computationally challenging for larger countries like the DRC, Tanzania or Ethiopia. I fit these models with 10-fold cross-validation and default-tuning of the relevant [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)).

```{r, warning = FALSE, results='hide'}
# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")

# Fit 4 additional calibration models using all of the gridded features
clist <- caretList(fcal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("glmStepAIC", "rf", "xgbTree", "cubist"),
                   preProcess = c("center","scale"),
                   metric = "RMSE")

gl.pred <- predict(grids, clist$glmStepAIC)
rf.pred <- predict(grids, clist$rf)
xt.pred <- predict(grids, clist$xgbTree)
cu.pred <- predict(grids, clist$cubist)
stopCluster(mc)
fname <- paste("./PSM/Results/", labs, "_clist.rds", sep = "")
saveRDS(clist, fname)
```

## Stacked predictions

The main point here is not to evaluate a *best individual model* but rather to evaluate the combination of the 6 previously fitted models against a 20% [hold-out](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another.

```{r}
# Stacking setup
preds <- stack(gm.pred, cp.pred, gl.pred, rf.pred, xt.pred, cu.pred)
names(preds) <- c("gm","cp","gl","rf","xt","cu")

# Extract model predictions
coordinates(gs_val) <- ~x+y
projection(gs_val) <- projection(preds)
gspred <- extract(preds, gs_val)
gspred <- as.data.frame(cbind(gs_val, gspred))

# Stacking model validation labels and features
gs_val <- as.data.frame(gs_val)
lval <- as.vector(t(gs_val[labs]))
fval <- gspred[,74:79] ## subset validation features
```

This next chunk fits the stacked model with the `glmStepAIC` function from the `MASS` library using the **validation dataframe**. You could explore other options here, but I find that this provides a reasonable combination and weighting of the 6 models that were produced in the model training steps presented above.

```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# model setup
set.seed(seed)
tc <- trainControl(method="repeatedcv", number=10, repeats=3, allowParallel=T)

st <- train(fval, lval,
            method = "glmStepAIC",
            trControl = tc,
            metric = "RMSE")

st.pred <- predict(preds, st) ## spatial predictions
stopCluster(mc)
fname <- paste("./PSM/Results/", labs, "_st.rds", sep = "")
saveRDS(st, fname)
```

```{r, echo = FALSE}
summary(st)
```

Topsoil pH generally takes on values between 3 - 9 in Rwanda, and there are extensive areas of very acid soils in the country. The chunk below generates the prediction map of pH across the RwaSIS cropland ROI.

```{r, results = 'hide'}
# Download cropland mask image (https://osf.io/b642y/)
# setwd("./PSM")
osf_retrieve_file("b642y") %>% osf_download(conflicts = "overwrite")

# Apply cropland mask
cpmask <- raster("RW_CP_mask.tif")
stm.pred <- mask(st.pred, cpmask, inverse=FALSE, maskvalue=NA, updatevalue=NA)

# Project stm.pred to EPSG:3857
stll <- projectRaster(stm.pred, crs="+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
```

```{r}
# Set color pallette
pal <- colorBin("Reds", domain = 3:9, reverse = TRUE, na.color = "transparent")

# Render map
w <- leaflet() %>% 
  setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 8) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addRasterImage(stll, colors = pal, opacity = 0.5) %>%
  addLegend(pal = pal, values = values(stll), title = "Topsoil pH")
w ## plot widget  
```

## Geotif outputs

This chunk writes out the 7 prediction rasters that you can import as geotif files into a GIS system of your choice for visualizations, reports and/or further processing.

```{r}
# Write prediction grids
gspreds <- stack(preds, st.pred)
names(gspreds) <- c("gm","cp","gl","rf","xt","cu","st")
fname <- paste("./PSM/Results/","RW_", labs, "_preds_2020.tif", sep = "")
writeRaster(gspreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)
```

## {-}

# Stacked prediction uncertainty estimates

There are many ways to quantify the uncertainty inherent in these predictions. We take a simple but quite robust approach here using quantile regression with ([quantreg](https://cran.r-project.org/web/packages/quantreg/quantreg.pdf)). We are mainly interested in the initial spread of the ROI-wide predictions (sensu, their 95% probable intervals). Once new data are collected, we shall be revising these and will also point out some additional (more spatially explicit) techniques.

```{r}
coordinates(gsdat) <- ~x+y
projection(gsdat) <- projection(grids)
gspre <- extract(gspreds, gsdat)
gsout <- as.data.frame(cbind(gsdat, gspre))
stQ <- rq(pH~st, tau=c(0.025,0.5,0.975), data=gsout) ## quantile regression fit
```

```{r, fig.align = "center", fig.cap = "Quantile regression plot of modeled vs observed topsoil (0-20 cm) pH in Rwanda's croplands. The blue lines are the 2.5% and 97.5% quantile regression estimates."}
par(pty="s", mar=c(4,4,1,1))
plot(pH~st, xlab="Ensemble pH prediction", ylab="Measured pH (water)", cex.lab=1.3, 
     xlim=c(3,9), ylim=c(3,9), gsout)
curve(stQ$coefficients[2]*x+stQ$coefficients[1], add=T, from=3, to=9, col="blue", lwd=2)
curve(stQ$coefficients[4]*x+stQ$coefficients[3], add=T, from=3, to=9, col="red", lwd=2)
curve(stQ$coefficients[6]*x+stQ$coefficients[5], add=T, from=3, to=9, col="blue", lwd=2)
abline(c(0,1), col="black")
```

```{r, echo = FALSE}
print(stQ)
```

# Main takeaways

* The machine learning workflow presented in this notebook produces precise and accurate predictions of topsoil pH values of Rwanda based on georeferenced legacy data. It can be flexibly updated as new data e.g., via the [RwaSIS project](https://minagri.prod.risa.rw/updates/news-details/rwanda-launches-soil-information-service-project-to-improve-agriculture-productivity) become available over the coming year.

* The workflow can be easily changed to model and predict the spatial distributions other important soil properties e.g., SOC, eCEC, P, K, Ca, Mg and S concentrations, among others. It can also be rapidly extended to support new geographical regions of interest and their respective soil mapping institutions, operationally and at low cost.

* As a final thought about this notebook: should you come up with a better set of data-sources, analysis suggestions, approaches, algorithms or predictions around these data, we would certainly like to hear about them! Any questions or comments are most welcome via [AFSIS](mailto:mgw.africasoils.info).