-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathRwaSIS_pred_map.Rmd
436 lines (341 loc) · 22 KB
/
RwaSIS_pred_map.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
---
title: RwaSIS predictive soil mapping guidelines
author: M.G. Walsh, R. Manners, B.A. Walsh, and R.A. MacMillan
date: "`r format(Sys.time(), '%d, %B, %Y')`"
output:
html_document:
toc: true
toc_depth: 1
css: style.css
---
# Introduction
This notebook provides practical guidelines for predictive soil mapping. The specifics can, and should of course be, modified to fit the purpose of the specific predictive soil mapping tasks of interest. Our main intent here is to provide a reproducible, generalized mapping framework and the associated computing workflows in [R](https://www.r-project.org/). The actual data and workflows should also be readily transferable to other computing environments if needed.
The first section of the notebook sets up georeferenced soil and remote sensing data of Rwanda for spatial analyses and predictions. We focus on *Croplands*, which are the primary [Region of Interest (ROI)](https://en.wikipedia.org/wiki/Region_of_interest) and the target for various land management interventions of the [RwaSIS project](https://minagri.prod.risa.rw/updates/news-details/rwanda-launches-soil-information-service-project-to-improve-agriculture-productivity). Based on a recent high-resolution [GeoSurvey (2019)](https://osf.io/w5e6c/), croplands are currently estimated to occupy ~68% of Rwanda's overall land area (of ~2.37 Mha). The gridded data offer several advantages including: Africa-wide coverage and consistency, and a robust analysis process that ensures high quality predictions for use in your predictive soil mapping applications.
The subsequent sections illustrate prediction workflows, which use ensembles of machine learning algorithms (MLAs). See the various review articles at: [Emsemble models](https://www.sciencedirect.com/topics/computer-science/ensemble-modeling)) with various remote sensing and GIS training input and validation layers. There are also some really good free-and-open books (e.g. [Lovelace et al, 2021](https://geocompr.robinlovelace.net/) and [Hengl & MacMillan, 2019](https://soilmapper.org/)) available if you'd like to get into the subject matter at greater depth.
While the [Legacy soil data](https://en.wikipedia.org/wiki/Legacy_system) used in this example were not collected on a replicable geostatistical sampling frame, they may provide an inital indication of where specific soil problems or nutrient deficiencies/imbalances are prevalent in Rwanda. As new data become available over the course of the project, we will update the relevant sources and soil mapping workflows.
In the last section of the notebook, we produce initial (robust) uncertainty estimates of the soil property predictions that are generated by our ensemble model workflow via [Quantile Regression](https://cran.r-project.org/web/packages/quantreg/quantreg.pdf). These estimates show where in Rwanda the current models fit the ground measurements reasonably well and where they do not. As new data are generated by RwaSIS (and others) these uncertainties are likely to decrease, potentially increasing the level of confidence in any resulting cropland management recommendations (e.g. for fertilizer, lime ammendments and/or soil erosion predictions).
# General data setup
To run this notebook, you will need to load the packages indicated in the chunk directly below. This allows you to assemble the different example dataframes and rasters providing a lot of options to generate spatial predictions via machine learning and/or geostatistical algorithms. We encourage and challenge you to explore the many options in this context!
```{r}
# Required packages
suppressPackageStartupMessages({
require(downloader)
require(rgdal)
require(raster)
require(DT)
require(leaflet)
require(htmlwidgets)
})
```
The following chunk then downloads the data that we have assembled, which are needed for running this particular example. Specifically, it assembles georeferenced field observations of soil measurements, e.g. (pH, soil carbon and soil nutrient composition) and links these to remote sensing and GIS images represented by `grids` variable stack. We focus on predicting the topsoil pH measurements here, but the approach is applicable to any given [Georeferenced](https://en.wikipedia.org/wiki/Georeferencing) soil property or measurement under consideration.
```{r, messages=FALSE}
# Data downloads -----------------------------------------------------------
# Create a data folder in your current working directory
dir.create("Wetchem", showWarnings=F)
setwd("./Wetchem")
# download soil data
download("https://osf.io/djcfz?raw=1", "RW_wetchem.zip", mode = "wb")
unzip("RW_wetchem.zip", overwrite = T)
prof <- read.table("Profiles.csv", header = T, sep = ",")
samp <- read.table("Samples.csv", header = T, sep = ",")
geos <- merge(prof, samp, by="pid")
# download and assemble raster stacks
download("https://osf.io/hp6v7?raw=1", "RW_250m_2020.zip", mode = "wb")
unzip("RW_250m_2020.zip", overwrite = T)
download("https://osf.io/u73pd?raw=1", "RW_GS_preds.zip", mode = "wb")
unzip("RW_GS_preds.zip", overwrite = T)
glist <- list.files(pattern="tif", full.names = T)
grids <- stack(glist)
# download figures
dir.create("Figures", showWarnings = F)
download("https://osf.io/42gry/", "./Figures/figures.zip", mode = "wb")
unzip("figures.zip", overwrite = T)
```
The processed Rwanda remote sensing and GIS data (covariates, ... `grids` in the chunk above) were derived from and transformed from their primary open sources. You can download the entire and the `grid` raster stack description at [RwaSIS grids](https://osf.io/hp6v7/). The short descriptions of the covariates included, and their sources are provided in the table immediately below.
\
```{r, echo=FALSE, results='asis'}
download("https://osf.io/e2x4s?raw=1", "Grid variables.csv", mode = "wb")
vars <- read.table("Grid variables.csv", header = T, sep = ",")
datatable(vars)
```
\
These Rwanda-wide (actually Africa-wide) covariates will change over time and we will update them if and when needed. Also note that these are grouped by factor variables that designate cropland condition (*x*) as a function of *f(x) ~ (a,c,o,r,s)* where:
* a - anthropic variables
* c - climatic variables
* o - organismal (primarily vegetative/land cover related)
* r - relief/topographical/geographical variables
* s - soil related variables
The next chunk then sets up the properties in a dataframe that will generate the the training and validation subsets for the different MLAs we shall apply to the spatial prediction of cropland properties. The soil property data presented here are courtesy of [CROPNUTS](https://cropnuts.com/).
```{r}
# Data setup --------------------------------------------------------------
# project legacy data coords to grid CRS
geos.proj <- as.data.frame(project(cbind(geos$lon, geos$lat), "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))
colnames(geos.proj) <- c("x","y")
geos <- cbind(geos, geos.proj)
coordinates(geos) <- ~x+y
projection(geos) <- projection(grids)
# extract gridded variables at survey locations
geosgrid <- extract(grids, geos)
gsdat <- as.data.frame(cbind(geos, geosgrid))
str(gsdat) ## display the structure of the resulting dataframe
```
The following chunk then writes out the dataframe `RW_soil_data.csv` into your `./Results` directory if you'd like to process those outputs in software other than R.
```{r}
# Write data frame --------------------------------------------------------
dir.create("Results", showWarnings = F)
write.csv(gsdat, "./Results/RW_soil_data.csv", row.names = F)
# Soil sample locations ---------------------------------------------------
w <- leaflet() %>%
setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 8) %>%
addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
addCircleMarkers(gsdat$lon, gsdat$lat, clusterOptions = markerClusterOptions())
saveWidget(w, 'RW_soil_sample_locs.html', selfcontained = T) ## save widget
w ## plot widget
```
# Machine-learning-based predictive mapping
The following chunks predict topsoil soil pH values using different machine learning algorithms (MLAs) with varying remote sensing and GIS (covariate) inputs. The generalized MLA prediction workflow is shown in the figure below. This general approach has won a lot of data science competitions e.g., at [Kaggle](https://www.kaggle.com/). You may want to take a look there. They have some fantastic data science resources, courses and challenges openly available.
```{r training_validation_approach, echo=FALSE, fig.align="center", fig.cap="MLA training, validation and prediction workflow.", out.width = '80%', }
knitr::include_graphics("Figures/training_validation.png")
```
The main idea is to train a number of potentially contrasting models with [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). At the end of the model training processes, the various models are ensembled (combined/stacked) on an *independent* validation dataset. When applied over time and space, this is a form of [Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning), which should produce increasingly accurate predictions as new field, lab data and MLAs are obtained and run.
To execute this next section of the notebook, you may need to install and load some of the additional R-packages that are needed to run the example. You can always alter these and/or automatically adjust and/or add other MLAs using the `caret` package see at: [caret](https://topepo.github.io/caret/index.html), as you see fit.
```{r}
# Required packages
# install.packages(c("devtools","caret","mgcv","MASS","randomForest","gbm","Cubist","plyr","doParallel","dismo")), dependencies=T)
suppressPackageStartupMessages({
require(devtools)
require(caret)
require(mgcv)
require(MASS)
require(randomForest)
require(gbm)
require(Cubist)
require(plyr)
require(doParallel)
})
```
The following chunk scrubs some of the objects in memory, removes incomplete cases (if there are any), sets-up labels and features, and creates a randomized partition between the training and validation dataframes. Everything is parallelized to facilitate efficient use of either local or cloud-based computing resources. Note that there are other options available for this (e.g. [foreach](https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html), among others.
```{r}
# Data setup --------------------------------------------------------------
rm(list=setdiff(ls(), c("gsdat","grids","glist"))) ## scrub extraneous objects in memory
gsdat <- gsdat[complete.cases(gsdat[ ,c(24:77)]),] ## removes incomplete cases
# set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)
# split data into calibration and validation sets
gsIndex <- createDataPartition(gsdat$pH, p = 4/5, list = F, times = 1)
gs_cal <- gsdat[ gsIndex,]
gs_val <- gsdat[-gsIndex,]
# Soil calibration labels
labs <- c("pH") ## insert other labels (e.g. "C","N","P","K" ...) here!
lcal <- as.vector(t(gs_cal[labs]))
# raster calibration features
fcal <- gs_cal[,24:48,52:79]
```
Note that we are using topsoil (0-20 cm) pH as an example in this context. You can substitute other soil properties like (C, N, P, K, etc.) as labels, in the `labs` (target) variable. You may want to transform those `labs` variables prior to model fitting, and perhaps also to tune the respective models to better represent the distributional and/or compositional attributes of the indiviual soil properties, should the need arise.
## Spatial trend model ([mgcv](https://cran.r-project.org/web/packages/mgcv/mgcv.pdf))
This is a simple spatially smoothed *generalized additive model* applying the `gam` function on the pH values at different sampling locations in Rwanda, based only on their georeference. It is similar to ordinary kriging with cross-validation ... but it is simpler and much faster to compute in this context.
```{r, results='hide'}
# select locational covariates
gf_cpv <- gs_cal[,49:51]
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)
# model training
gm <- train(gf_cpv, lcal,
method = "gam",
preProc = c("center","scale"),
metric = "RMSE",
trControl = tc)
# model outputs & predictions
gm.pred <- predict(grids, gm) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_gm.rds", sep = "")
saveRDS(gm, fname)
```
## Central place model ([MASS](https://cran.r-project.org/web/packages/MASS/MASS.pdf))
Central places are influential variables in places where human impacts occur. They are correlated with both extraction and deposition of soil nutrients and toxic elements, soil erosion and deposition, acidification and many other soil disturbance processess. The model below focuses on central place indicators such as distances to roads and settlements, surface water sources, cell towers and electricity networks among others.
```{r, results = 'hide'}
# select central place covariates
gf_cpv <- gs_cal[,35:48,67]
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)
# model training
gl1 <- train(gf_cpv, lcal,
method = "glmStepAIC",
preProc = c("center","scale"),
trControl = tc,
metric = "RMSE")
# model outputs & predictions
gl1.pred <- predict(grids, gl1) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_gl1.rds", sep = "")
saveRDS(gl1, fname)
```
## GLM with all spatial covariates ([MASS](https://cran.r-project.org/web/packages/MASS/MASS.pdf))
This model is very similar to the *Central place model* above, but it initially contains all of the 56 covariates and is then backward selects from them to generate a prediction via a generalized linear model.
```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)
# model training
gl2 <- train(fcal, lcal,
method = "glmStepAIC",
preProc = c("center","scale"),
trControl = tc,
metric ="RMSE")
# model outputs & predictions
gl2.pred <- predict(grids, gl2) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_gl2.rds", sep = "")
saveRDS(gl2, fname)
```
## Random forest ([randomForest](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf))
The below is a bagging chunk that uses [Breiman & Cutler's](https://link.springer.com/article/10.1023/A:1010933404324) algorithm with all of the covariate data. A good, short article to look at for reference in context here is [Barnard et al.](https://www.researchgate.net/publication/331328203_Can't_see_the_random_forest_for_the_decision_trees_selecting_predictive_models_for_restoration_ecology).
```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)
tg <- expand.grid(mtry = seq(5,10, by=1)) ## model tuning steps
# model training
rf <- train(fcal, lcal,
preProc = c("center","scale"),
method = "rf",
ntree = 501,
metric = "RMSE",
tuneGrid = tg,
trControl = tc)
# model outputs & predictions
print(rf) ## RMSE's accross tuning parameters
rf.pred <- predict(grids, rf) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_rf.rds", sep = "")
saveRDS(rf, fname)
```
## Generalized boosting ([gbm](https://cran.r-project.org/web/packages/gbm/gbm.pdf))
This next chunk represents one of the *boosting* techniques that can be used for both regression or classification. It is similar to the `randomForest` above, but uses a boosting technique that emphasizes successful predictions rather than penalizing poor predictions via *bagging*. There is a wide-array of literature around the so-called *"greedy algorithms"*. Very good descriptions of these are provided in [Hastie et al, 2009](https://web.stanford.edu/~hastie/ElemStatLearn/).
```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", allowParallel = T)
## for initial <gbm> tuning guidelines see @ https://stats.stackexchange.com/questions/25748/what-are-some-useful-guidelines-for-gbm-parameters
tg <- expand.grid(interaction.depth = seq(2,5, by=1), shrinkage = 0.01, n.trees = seq(101,501, by=50),
n.minobsinnode = 50) ## model tuning steps
# model training
gb <- train(fcal, lcal,
method = "gbm",
preProc = c("center", "scale"),
trControl = tc,
tuneGrid = tg,
metric = "RMSE")
# model outputs & predictions
print(gb) ## RMSE's accross tuning parameters
gb.pred <- predict(grids, gb) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_gb.rds", sep = "")
saveRDS(gb, fname)
```
## Cubist ([Cubist](https://cran.r-project.org/web/packages/Cubist/Cubist.pdf))
Cubist is an extension of the [Quinlan, 1992](https://www.scirp.org/(S(i43dyn45teexjx455qlt3d2q))/reference/ReferencesPapers.aspx?ReferenceID=1865452) algorithm. The implementation in R can only be used for regression problems, but in those contexts it tends to be an influential contender (also here) in terms of prediction performance.
```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(seed)
tc <- trainControl(method="repeatedcv", number=10, repeats=3, allowParallel = T)
# tg <- needs tuning
cu <- train(fcal, lcal,
method = "cubist",
trControl = tc,
metric = "RMSE")
print(cu)
cu.pred <- predict(grids, cu) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_cu.rds", sep = "")
saveRDS(cu, fname)
```
# Model ensemble predictions
The main point here is not to evaluate a *best individual model* but rather to evaluate the combination of the previously fitted models against a 20% [hold-out](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another.
```{r}
# Stacking setup ------------------------------------------------------------
preds <- stack(gm.pred, gl1.pred, gl2.pred, rf.pred, gb.pred, cu.pred)
names(preds) <- c("gm","gl1","gl2","rf","gb","cu")
# plot(preds, axes = F)
# extract model predictions
coordinates(gs_val) <- ~x+y
projection(gs_val) <- projection(preds)
gspred <- extract(preds, gs_val)
gspred <- as.data.frame(cbind(gs_val, gspred))
# stacking model validation labels and features
gs_val <- as.data.frame(gs_val)
lval <- as.vector(t(gs_val[labs]))
fval <- gspred[,80:85] ## subset validation features
```
This chunk does the model ensemble with `glmStepAIC` function from the `MASS` library. You could explore other options here, but we find that this provides a reasonable combination and weighting of the 6 models that were produced in the ensemble training steps.
```{r, results='hide'}
# Stacked model -------------------------------------------------------------
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# model setup
set.seed(seed)
tc <- trainControl(method="repeatedcv", number=10, repeats=3, allowParallel=T)
st <- train(fval, lval,
method = "glmStepAIC",
trControl = tc,
metric = "RMSE")
summary(st)
st.pred <- predict(preds, st) ## spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_st.rds", sep = "")
saveRDS(st, fname)
plot(st.pred, axes=F)
```
## Write the prediction grids
This chunk writes out the 7 prediction rasters that you can import as geotif files into any GIS sytem for visualizations, reports and/or further processing.
```{r}
# Write prediction grids --------------------------------------------------
gspreds <- stack(preds, st.pred)
names(gspreds) <- c("gm","gl1","gl2","rf","gb","cu","st")
fname <- paste("./Results/","RW_", labs, "_preds_2020.tif", sep = "")
writeRaster(gspreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)
```
# Ensemble prediction uncertainty estimates
There are numerous ways to quantify the uncertainty inherent in these predictions. We take a simple but quite robust approach here using quantile regression with ([quantreg](https://cran.r-project.org/web/packages/quantreg/quantreg.pdf)). We are mainly interested in the initial spread of the ROI-wide predictions (sensu, their 90% probable intervals). Once new data are collected, we shall be revising these and will also point out some additional (more spatially data sensitive) techniques.
```{r, message=FALSE}
# Uncertainty estimates via quantile regression ---------------------------
# note that this is just an example for pH ... generalize & move to a seperate script
require(quantreg)
coordinates(gsdat) <- ~x+y
projection(gsdat) <- projection(grids)
gspre <- extract(gspreds, gsdat)
gsout <- as.data.frame(cbind(gsdat, gspre))
# estimate & plot
par(pty="s")
par(mfrow=c(1,1), mar=c(5,5,1,1))
plot(pH~st, xlab="Ensemble pH prediction", ylab="Measured pH (water)", cex.lab=1.3,
xlim=c(3,9), ylim=c(3,9), gsout)
stQ <- rq(pH~st, tau=c(0.05,0.5,0.95), data=gsout)
print(stQ)
curve(stQ$coefficients[2]*x+stQ$coefficients[1], add=T, from=3, to=9, col="blue", lwd=2)
curve(stQ$coefficients[4]*x+stQ$coefficients[3], add=T, from=3, to=9, col="red", lwd=2)
curve(stQ$coefficients[6]*x+stQ$coefficients[5], add=T, from=3, to=9, col="blue", lwd=2)
```
As a final thought about this notebook, should you come up with a better set of existing data-sources, analysis suggestions, approaches, algorithms or predictions around these data, we would certainly like to hear about them! You can always contact us at [AFSIS](mgw.africasoils.info). We shall also provide more guidance around other soil property predictions.