RwaSIS_pred_map.Rmd

---
title: RwaSIS predictive soil mapping guidelines 
author: M.G. Walsh, R. Manners, B.A. Walsh, and R.A. MacMillan
date: "`r format(Sys.time(), '%d, %B, %Y')`"
output:
  html_document:
    toc: true
    toc_depth: 1
    css: style.css
---

# Introduction

This notebook provides practical guidelines for predictive soil mapping. The specifics can, and should of course be, modified to fit the purpose of the specific predictive soil mapping tasks of interest. Our main intent here is to provide a reproducible, generalized mapping framework and the associated computing workflows in [R](https://www.r-project.org/). The actual data and workflows should also be readily transferable to other computing environments if needed.

The first section of the notebook sets up georeferenced soil and remote sensing data of Rwanda for spatial analyses  and predictions. We focus on *Croplands*, which are the primary [Region of Interest (ROI)](https://en.wikipedia.org/wiki/Region_of_interest) and the target for various land management interventions of the [RwaSIS project](https://minagri.prod.risa.rw/updates/news-details/rwanda-launches-soil-information-service-project-to-improve-agriculture-productivity). Based on a recent high-resolution [GeoSurvey (2019)](https://osf.io/w5e6c/), croplands are currently estimated to occupy ~68% of Rwanda's overall land area (of ~2.37 Mha). The gridded data offer several advantages including: Africa-wide coverage and consistency, and a robust analysis process that ensures high quality predictions for use in your predictive soil mapping applications.

The subsequent sections illustrate prediction workflows, which use ensembles of machine learning algorithms (MLAs). See the various review articles at: [Emsemble models](https://www.sciencedirect.com/topics/computer-science/ensemble-modeling)) with various remote sensing and GIS training input and validation layers. There are also some really good free-and-open books (e.g. [Lovelace et al, 2021](https://geocompr.robinlovelace.net/) and [Hengl & MacMillan, 2019](https://soilmapper.org/)) available if you'd like to get into the subject matter at greater depth.

While the [Legacy soil data](https://en.wikipedia.org/wiki/Legacy_system) used in this example were not collected on a replicable geostatistical sampling frame, they may provide an inital indication of where specific soil problems or nutrient deficiencies/imbalances are prevalent in Rwanda. As new data become available over the course of the project, we will update the relevant sources and soil mapping workflows.

In the last section of the notebook, we produce initial (robust) uncertainty estimates of the soil property predictions that are generated by our ensemble model workflow via [Quantile Regression](https://cran.r-project.org/web/packages/quantreg/quantreg.pdf). These estimates show where in Rwanda the current models fit the ground measurements reasonably well and where they do not. As new data are generated by RwaSIS (and others) these uncertainties are likely to decrease, potentially increasing the level of confidence in any resulting cropland management recommendations (e.g. for fertilizer, lime ammendments and/or soil erosion predictions). 

# General data setup

To run this notebook, you will need to load the packages indicated in the chunk directly below. This allows you to assemble the different example dataframes and rasters providing a lot of options to generate spatial predictions via machine learning and/or geostatistical algorithms. We encourage and challenge you to explore the many options in this context!

```{r}
# Required packages
suppressPackageStartupMessages({
  require(downloader)
  require(rgdal)
  require(raster)
  require(DT)
  require(leaflet)
  require(htmlwidgets)
})
```

The following chunk then downloads the data that we have assembled, which are needed for running this particular example. Specifically, it assembles georeferenced field observations of soil measurements, e.g. (pH, soil carbon and soil nutrient composition) and links these to remote sensing and GIS images represented by `grids` variable stack. We focus on predicting the topsoil pH measurements here, but the approach is applicable to any given  [Georeferenced](https://en.wikipedia.org/wiki/Georeferencing) soil property or measurement under consideration. 

```{r, messages=FALSE}
# Data downloads -----------------------------------------------------------
# Create a data folder in  your current working directory
dir.create("Wetchem", showWarnings=F)
setwd("./Wetchem")

# download soil data
download("https://osf.io/djcfz?raw=1", "RW_wetchem.zip", mode = "wb")
unzip("RW_wetchem.zip", overwrite = T)
prof <- read.table("Profiles.csv", header = T, sep = ",")
samp <- read.table("Samples.csv", header = T, sep = ",")
geos <- merge(prof, samp, by="pid")

# download and assemble raster stacks
download("https://osf.io/hp6v7?raw=1", "RW_250m_2020.zip", mode = "wb")
unzip("RW_250m_2020.zip", overwrite = T)
download("https://osf.io/u73pd?raw=1", "RW_GS_preds.zip", mode = "wb")
unzip("RW_GS_preds.zip", overwrite = T)
glist <- list.files(pattern="tif", full.names = T)
grids <- stack(glist)

# download figures 
dir.create("Figures", showWarnings = F)
download("https://osf.io/42gry/", "./Figures/figures.zip", mode = "wb")
unzip("figures.zip", overwrite = T)
```

The processed Rwanda remote sensing and GIS data (covariates, ... `grids` in the chunk above) were derived from and transformed from their primary open sources. You can download the entire and the `grid` raster stack description at [RwaSIS grids](https://osf.io/hp6v7/). The short descriptions of the covariates included, and their sources are provided in the table immediately below.

\
```{r, echo=FALSE, results='asis'}
download("https://osf.io/e2x4s?raw=1", "Grid variables.csv", mode = "wb")
vars <- read.table("Grid variables.csv", header = T, sep = ",")
datatable(vars)
```
\
These Rwanda-wide (actually Africa-wide) covariates will change over time and we will update them if and when needed. Also note that these are grouped by factor variables that designate cropland condition (*x*) as a function of *f(x) ~ (a,c,o,r,s)* where:

* a - anthropic variables
* c - climatic variables
* o - organismal (primarily vegetative/land cover related)
* r - relief/topographical/geographical variables
* s - soil related variables

The next chunk then sets up the properties in a dataframe that will generate the the training and validation subsets for the different MLAs we shall apply to the spatial prediction of cropland properties. The soil property data presented here are courtesy of [CROPNUTS](https://cropnuts.com/). 

```{r}
# Data setup --------------------------------------------------------------
# project legacy data coords to grid CRS
geos.proj <- as.data.frame(project(cbind(geos$lon, geos$lat), "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))
colnames(geos.proj) <- c("x","y")
geos <- cbind(geos, geos.proj)
coordinates(geos) <- ~x+y
projection(geos) <- projection(grids)

# extract gridded variables at survey locations
geosgrid <- extract(grids, geos)
gsdat <- as.data.frame(cbind(geos, geosgrid))
str(gsdat) ## display the structure of the resulting dataframe
```

The following chunk then writes out the dataframe `RW_soil_data.csv` into your `./Results` directory if you'd like to process those outputs in software other than R.

```{r}
# Write data frame --------------------------------------------------------
dir.create("Results", showWarnings = F)
write.csv(gsdat, "./Results/RW_soil_data.csv", row.names = F)

# Soil sample locations ---------------------------------------------------
w <- leaflet() %>%
  setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 8) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(gsdat$lon, gsdat$lat, clusterOptions = markerClusterOptions())
saveWidget(w, 'RW_soil_sample_locs.html', selfcontained = T) ## save widget
w ## plot widget 
```

# Machine-learning-based predictive mapping

The following chunks predict topsoil soil pH values using different machine learning algorithms (MLAs) with varying remote sensing and GIS (covariate) inputs. The generalized MLA prediction workflow is shown in the figure below. This general approach has won a lot of data science competitions e.g., at [Kaggle](https://www.kaggle.com/). You may want to take a look there. They have some fantastic data science resources, courses and challenges openly available.

```{r training_validation_approach, echo=FALSE, fig.align="center", fig.cap="MLA training, validation and prediction workflow.", out.width = '80%', }
knitr::include_graphics("Figures/training_validation.png")
```

The main idea is to train a number of potentially contrasting models with [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). At the end of the model training processes, the various models are ensembled (combined/stacked) on an *independent* validation dataset. When applied over time and space, this is a form of [Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning), which should produce increasingly accurate predictions as new field, lab data and MLAs are obtained and run. 

To execute this next section of the notebook, you may need to install and load some of the additional R-packages that are needed to run the example. You can always alter these and/or automatically adjust and/or add other MLAs using the `caret` package see at: [caret](https://topepo.github.io/caret/index.html), as you see fit.

```{r}
# Required packages
# install.packages(c("devtools","caret","mgcv","MASS","randomForest","gbm","Cubist","plyr","doParallel","dismo")), dependencies=T)
suppressPackageStartupMessages({
  require(devtools)
  require(caret)
  require(mgcv)
  require(MASS)
  require(randomForest)
  require(gbm)
  require(Cubist)
  require(plyr)
  require(doParallel)
})
```

The following chunk scrubs some of the objects in memory, removes incomplete cases (if there are any), sets-up labels and features, and creates a randomized partition between the training and validation dataframes. Everything is parallelized to facilitate efficient use of either local or cloud-based computing resources. Note that there are other options available for this (e.g. [foreach](https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html), among others.

```{r}
# Data setup --------------------------------------------------------------
rm(list=setdiff(ls(), c("gsdat","grids","glist"))) ## scrub extraneous objects in memory
gsdat <- gsdat[complete.cases(gsdat[ ,c(24:77)]),] ## removes incomplete cases

# set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)

# split data into calibration and validation sets
gsIndex <- createDataPartition(gsdat$pH, p = 4/5, list = F, times = 1)
gs_cal <- gsdat[ gsIndex,]
gs_val <- gsdat[-gsIndex,]

# Soil calibration labels
labs <- c("pH") ## insert other labels (e.g. "C","N","P","K" ...) here!
lcal <- as.vector(t(gs_cal[labs]))

# raster calibration features
fcal <- gs_cal[,24:48,52:79]
```

Note that we are using topsoil (0-20 cm) pH as an example in this context. You can substitute other soil properties like (C, N, P, K, etc.) as labels, in the `labs` (target) variable. You may want to transform those `labs` variables prior to model fitting, and perhaps also to tune the respective models to better represent the distributional and/or compositional attributes of the indiviual soil properties, should the need arise.

## Spatial trend model ([mgcv](https://cran.r-project.org/web/packages/mgcv/mgcv.pdf))

This is a simple spatially smoothed *generalized additive model* applying the `gam` function on the pH values at different sampling locations in Rwanda, based only on their georeference. It is similar to ordinary kriging with cross-validation ... but it is simpler and much faster to compute in this context.

```{r, results='hide'}
# select locational covariates
gf_cpv <- gs_cal[,49:51]

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)

# model training
gm <- train(gf_cpv, lcal, 
            method = "gam",
            preProc = c("center","scale"), 
            metric = "RMSE",
            trControl = tc)

# model outputs & predictions
gm.pred <- predict(grids, gm) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_gm.rds", sep = "")
saveRDS(gm, fname)
```

## Central place model ([MASS](https://cran.r-project.org/web/packages/MASS/MASS.pdf))

Central places are influential variables in places where human impacts occur. They are correlated with both extraction and deposition of soil nutrients and toxic elements, soil erosion and deposition, acidification and many other soil disturbance processess. The model below focuses on central place indicators such as distances to roads and settlements, surface water sources, cell towers and electricity networks among others.

```{r, results = 'hide'}
# select central place covariates
gf_cpv <- gs_cal[,35:48,67]

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)

# model training
gl1 <- train(gf_cpv, lcal, 
             method = "glmStepAIC",
             preProc = c("center","scale"), 
             trControl = tc,
             metric = "RMSE")

# model outputs & predictions
gl1.pred <- predict(grids, gl1) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_gl1.rds", sep = "")
saveRDS(gl1, fname)
```

## GLM with all spatial covariates ([MASS](https://cran.r-project.org/web/packages/MASS/MASS.pdf))

This model is very similar to the *Central place model* above, but it initially contains all of the 56 covariates and is then backward selects from them to generate a prediction via a generalized linear model.

```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)

# model training
gl2 <- train(fcal, lcal, 
             method = "glmStepAIC",
             preProc = c("center","scale"), 
             trControl = tc,
             metric ="RMSE")

# model outputs & predictions
gl2.pred <- predict(grids, gl2) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_gl2.rds", sep = "")
saveRDS(gl2, fname)
```

## Random forest ([randomForest](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf))

The below is a bagging chunk that uses [Breiman & Cutler's](https://link.springer.com/article/10.1023/A:1010933404324) algorithm with all of the covariate data. A good, short article to look at for reference in context here is [Barnard et al.](https://www.researchgate.net/publication/331328203_Can't_see_the_random_forest_for_the_decision_trees_selecting_predictive_models_for_restoration_ecology).

```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)
tg <- expand.grid(mtry = seq(5,10, by=1)) ## model tuning steps

# model training
rf <- train(fcal, lcal,
            preProc = c("center","scale"),
            method = "rf",
            ntree = 501,
            metric = "RMSE",
            tuneGrid = tg,
            trControl = tc)

# model outputs & predictions
print(rf) ## RMSE's accross tuning parameters
rf.pred <- predict(grids, rf) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_rf.rds", sep = "")
saveRDS(rf, fname)
```

## Generalized boosting ([gbm](https://cran.r-project.org/web/packages/gbm/gbm.pdf))

This next chunk represents one of the *boosting* techniques that can be used for both regression or classification. It is similar to the `randomForest` above, but uses a boosting technique that emphasizes successful predictions rather than penalizing poor predictions via *bagging*. There is a wide-array of literature around the so-called *"greedy algorithms"*. Very good descriptions of these are provided in [Hastie et al, 2009](https://web.stanford.edu/~hastie/ElemStatLearn/).

```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)

## for initial <gbm> tuning guidelines see @ https://stats.stackexchange.com/questions/25748/what-are-some-useful-guidelines-for-gbm-parameters
tg <- expand.grid(interaction.depth = seq(2,5, by=1), shrinkage = 0.01, n.trees = seq(101,501, by=50),
                  n.minobsinnode = 50) ## model tuning steps

# model training
gb <- train(fcal, lcal, 
            method = "gbm", 
            preProc = c("center", "scale"),
            trControl = tc,
            tuneGrid = tg,
            metric = "RMSE")

# model outputs & predictions
print(gb) ## RMSE's accross tuning parameters
gb.pred <- predict(grids, gb) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_gb.rds", sep = "")
saveRDS(gb, fname)
```

## Cubist ([Cubist](https://cran.r-project.org/web/packages/Cubist/Cubist.pdf))

Cubist is an extension of the [Quinlan, 1992](https://www.scirp.org/(S(i43dyn45teexjx455qlt3d2q))/reference/ReferencesPapers.aspx?ReferenceID=1865452) algorithm. The implementation in R can only be used for regression problems, but in those contexts it tends to be an influential contender (also here) in terms of prediction performance.

```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(seed)
tc <- trainControl(method="repeatedcv", number=10, repeats=3, allowParallel = T)
# tg <- needs tuning

cu <- train(fcal, lcal, 
            method = "cubist", 
            trControl = tc,
            metric = "RMSE")

print(cu)
cu.pred <- predict(grids, cu) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_cu.rds", sep = "")
saveRDS(cu, fname)
```

# Model ensemble predictions

The main point here is not to evaluate a *best individual model* but rather to evaluate the combination of the previously fitted models against a 20% [hold-out](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another. 

```{r}
# Stacking setup ------------------------------------------------------------
preds <- stack(gm.pred, gl1.pred, gl2.pred, rf.pred, gb.pred, cu.pred)
names(preds) <- c("gm","gl1","gl2","rf","gb","cu")
# plot(preds, axes = F)

# extract model predictions
coordinates(gs_val) <- ~x+y
projection(gs_val) <- projection(preds)
gspred <- extract(preds, gs_val)
gspred <- as.data.frame(cbind(gs_val, gspred))

# stacking model validation labels and features
gs_val <- as.data.frame(gs_val)
lval <- as.vector(t(gs_val[labs]))
fval <- gspred[,80:85] ## subset validation features
```

This chunk does the model ensemble with `glmStepAIC` function from the `MASS` library. You could explore other options here, but we find that this provides a reasonable combination and weighting of the 6 models that were produced in the ensemble training steps.

```{r, results='hide'}
# Stacked model -------------------------------------------------------------
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# model setup
set.seed(seed)
tc <- trainControl(method="repeatedcv", number=10, repeats=3, allowParallel=T)

st <- train(fval, lval,
            method = "glmStepAIC",
            trControl = tc,
            metric = "RMSE")

summary(st)
st.pred <- predict(preds, st) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_st.rds", sep = "")
saveRDS(st, fname)
plot(st.pred, axes=F)
```

## Write the prediction grids

This chunk writes out the 7 prediction rasters that you can import as geotif files into any GIS sytem for visualizations, reports and/or further processing. 

```{r}
# Write prediction grids --------------------------------------------------
gspreds <- stack(preds, st.pred)
names(gspreds) <- c("gm","gl1","gl2","rf","gb","cu","st")
fname <- paste("./Results/","RW_", labs, "_preds_2020.tif", sep = "")
writeRaster(gspreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)
```

# Ensemble prediction uncertainty estimates

There are numerous ways to quantify the uncertainty inherent in these predictions. We take a simple but quite robust approach here using quantile regression with ([quantreg](https://cran.r-project.org/web/packages/quantreg/quantreg.pdf)). We are mainly interested in the initial spread of the ROI-wide predictions (sensu, their 90% probable intervals). Once new data are collected, we shall be revising these and will also point out some additional (more spatially data sensitive) techniques.

```{r, message=FALSE}
# Uncertainty estimates via quantile regression ---------------------------
# note that this is just an example for pH ... generalize & move to a seperate script
require(quantreg)

coordinates(gsdat) <- ~x+y
projection(gsdat) <- projection(grids)
gspre <- extract(gspreds, gsdat)
gsout <- as.data.frame(cbind(gsdat, gspre))

# estimate & plot
par(pty="s")
par(mfrow=c(1,1), mar=c(5,5,1,1))
plot(pH~st, xlab="Ensemble pH prediction", ylab="Measured pH (water)", cex.lab=1.3, 
     xlim=c(3,9), ylim=c(3,9), gsout)
stQ <- rq(pH~st, tau=c(0.05,0.5,0.95), data=gsout)
print(stQ)
curve(stQ$coefficients[2]*x+stQ$coefficients[1], add=T, from=3, to=9, col="blue", lwd=2)
curve(stQ$coefficients[4]*x+stQ$coefficients[3], add=T, from=3, to=9, col="red", lwd=2)
curve(stQ$coefficients[6]*x+stQ$coefficients[5], add=T, from=3, to=9, col="blue", lwd=2)
```

As a final thought about this notebook, should you come up with a better set of existing data-sources, analysis suggestions, approaches, algorithms or predictions around these data, we would certainly like to hear about them! You can always contact us at [AFSIS](mgw.africasoils.info). We shall also provide more guidance around other soil property predictions.