-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathZn_preds.Rmd
619 lines (468 loc) · 37.7 KB
/
Zn_preds.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
---
title: Chemometric workflows for predicting zinc levels in maize grain
author: M.G. Walsh, P.C. Nalivata, D. Gashu, S. Gameda, E.K. Towett, S.P. MacGrath, R.M. Lark and M.R. Broadley
date: "Last compiled on `r format(Sys.time(), '%d, %B, %Y')`"
output:
html_document:
toc: true
toc_depth: 1
fig_caption: true
css: style.css
---
```{r, echo=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```
# Introduction
Micronutrients i.e., the essential vitamins and minerals, are vital for healthy child development as well as for the prevention of morbidity and disease in adults. With the exception of vitamin D, micronutrients are not produced in the body and must be obtained from the diet. Although people only need small amounts of micronutrients, consuming the recommended amounts is important, as *micronutrient deficiencies* (MNDs) can have severe consequences on both health and well-being. However, national surveys containing dietary micronutrient intake data are not routinely collected in most developing countries because they are expensive and time consuming to obtain, both in the field and the laboratory. Also, the databases that are needed to analyze dietary intakes should be geographically explicit to be effective, but they are often completely lacking and/or not up-to-date in Africa. Similarly, agronomic interventions that focus on plant micronutrients for improving cropland and livestock productivity are inadequately supported by spatial and temporal data.
Zinc (Zn) deficiency is among the 5 most widespread MNDs (including iron, vitamin A, iodine, folate, in addition to zinc) that are of global concern ([Bailey et al., 2015](https://doi.org/10.1159/000371618)). Diagnosing zinc deficiency in maize grain is the focus of this notebook. Zn is a trace element found in varying concentrations in all soils, plants and animals. It is essential for the normal growth of higher plants and animals including humans. While Zn is needed in only small quantities, if those amounts are inadequate, plants and/or animals will suffer from physiological stress brought about by the dysfunction of several enzyme systems and other metabolic functions in which Zn plays a key role ([Wessels and Brown, 2012](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0050568)). In Malawi, Zn deficiency prevalence rates of ~60% in people have been reported based on a national survey of serum [biomarker](https://en.wikipedia.org/wiki/Biomarker) Zn concentrations ([Likoswe et al., 2020](https://doi.org/10.3390/nu12061563)). The prevalence may be even higher in rural areas ([Siyame et al., 2013](https://doi.org/10.1024/0300-9831/a000158)). The main causes of Zn deficiencies in food crops are soil-related including: total soil Zn content, redox potential and pH, mineralogy, the activity of soil microorganisms, presence of other nutrients, and climate conditions, see e.g., ([Noulas et al., 2018](https://doi.org/10.1016/j.jtemb.2018.02.009)).
Spectral signatures of soils, plants, animal tissues, and materials generally are defined by their reflectance or absorbance as a function of wavelength in the electromagnetic spectrum. Under laboratory conditions, the signatures result from electronic transitions of atoms and vibrational stretching and bending of structural groups of atoms that form molecules or crystals. Spectroscopy has been shown to provide highly repeatable, rapid and low cost measurements of many different soil, plant and animal tissue properties in numerous studies. The amount of light absorbed by a sample can be measured with minimal sample preparation (primarily drying and fine grinding) across a wide range of ultra-violet (UV), visible (VIS), near (NIR) and mid-infrared (MIR) wavelengths to provide a unique spectral signature of a soil or plant sample. An individual MIR measurement can be performed in about 30 seconds, in contrast to more conventional soil and plant tests, which are typically slow, labor-intensive, expensive and/or use hazardous chemicals, which have to be handled carefully.
In this notebook, we shall explore if Zn concentrations in maize grain can be reliably predicted from cheap, fast, non-destructive and non-hazardous MIR spectral data (features), using [ICP-MS](https://en.wikipedia.org/wiki/Inductively_coupled_plasma_mass_spectrometry) measurements as reference data. The analytical reference methods are described in ([Gashu et al., 2021](https://doi.org/10.1038/s41586-021-03559-3)).
The two main questions posed are:
* Can Zn concentrations in soils be used to predict Zn concentrations / deficiencies in maize grain? While there should be strong bio-kinetic links between soils and plants, the differential uptake and bioavailability of soil Zn in food crop tissues is complex and is not well quantified or predicted.
* Can we use MIR spectra to predict Zn concentrations in maize grain reliably? If so, this would advance and enable localized Zn deficiency diagnostics and risk assessments that could potentially be carried out routinely, at low cost, across large geographical regions ([ROI](https://en.wikipedia.org/wiki/Region_of_interest)) and populations of interest.
The notebook is also intended for self-learning in [R](https://www.r-project.org/). It does not go into the details of the underlying [spectroscopy](https://en.wikipedia.org/wiki/Spectroscopy), which is widely covered in the literature, but instead it focuses on [Chemometrics](https://en.wikipedia.org/wiki/Chemometrics) and the associated computing workflows that are needed to generate useful predictions from a population of spectral signatures of (features) relative to their corresponding reference measurements (labels). Chemometric techniques are used extensively in [analytical chemistry](https://en.wikipedia.org/wiki/Analytical_chemistry) and [metabolomics](https://en.wikipedia.org/wiki/Metabolomics). In this particular example, we shall use maize grain Zn reference data, and MIR spectral (soil and crop) data that were collected as part of the ongoing [GeoNutrition project](http://www.geonutrition.com) in Malawi. The described workflows might also be transferable to diagnosing other MNDs in soils and crops and/or to other geographies via e.g., [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning) and/or other approaches ... tbd.
# General data setup
To actually run the notebook, you will need to load the packages indicated in the chunk directly below. This allows you to assemble the reference wet chemistry and spectral data frames, providing a lot of options that generate spectral predictions of Zn concentrations in maize grain and soils. The notebook itself is maintained on [Github](https://github.com/mgwalsh/Bioavailability/blob/master/Zn_preds.Rmd), and you can fork and modify it from there as you see fit. Also, the html version of this will skip certain repetitive sections, but these are actually included in the markdown script for reference.
```{r}
# Package names
packages <- c("osfr", "caret", "caretEnsemble", "MASS", "pls", "glmnet", "randomForest", "gbm", "xgboost", "Cubist", "quantreg", "leaflet", "htmlwidgets", "plyr", "dplyr", "doParallel")
# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Load packages
invisible(lapply(packages, library, character.only = TRUE))
```
The next chunk downloads the data needed for running this example. It assembles the various soil and crop reference measurements collected across Malawi and links those to [Bruker HTS-XT](https://www.bruker.com/en/products-and-solutions/infrared-and-raman/ft-ir-routine-spectrometer/hts-xt-microplate-reader.html) spectra. Note that this chunk is Mac or Linux specific and so the directory structures would need to be changed slightly to run on Windows machines.
```{r}
# Create a data folder in your current working directory
dir.create("MW_Zn_data", showWarnings=F)
setwd("./MW_Zn_data")
dir.create("./Results")
# Data download
osf_retrieve_file("pdgk6") %>% osf_download(conflicts = "overwrite")
unzip("MW_Zn_data.zip", overwrite = T)
wchem <- read.table("MW_Zn_data.csv", header=T, sep=",")
sspec <- read.table("soil_MIR_spectra.csv", header=T, sep=",")
cspec <- read.table("plant_MIR_spectra.csv", header=T, sep=",")
# Merge the wet chemistry reference data frame with the corresponding spectra
sdata <- merge(wchem, sspec, by = "ssid") ## soil data frame
sdata <- sdata[complete.cases(sdata[ ,c(7:3594)]), ] ## removes incomplete cases
cdata <- merge(wchem, cspec, by = "csid") ## crop data frame
cdata <- cdata[complete.cases(cdata[ ,c(7:3594)]), ]
# Download figures
osf_retrieve_file("42gry") %>% osf_download(conflicts = "overwrite")
unzip("figures.zip", overwrite = T)
```
The following chunk then writes out initial data frames `MW_soil_Zn_data.csv` and `MW_crop_Zn_data.csv` into your `./MW_Zn_data/Results` directory if you'd like to process those outputs in software other than R. They are augmented with other variable and predictions in subsequent sections of the notebook. It also generates an overview map of where in Malawi those soil and plant samples were obtained. The spatial sampling frame that was used for collecting the samples is described in [Gashu et al., (2021)](https://doi.org/10.1038/s41586-021-03559-3).
```{r}
# Write out reference data frame
write.csv(sdata, "./MW_Zn_data/Results/MW_soil_Zn_data.csv", row.names = F)
write.csv(cdata, "./MW_Zn_data/Results/MW_crop_Zn_data.csv", row.names = F)
# Soil sample locations
w <- leaflet() %>%
setView(lng = mean(wchem$lon), lat = mean(wchem$lat), zoom = 7) %>%
addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
addCircleMarkers(wchem$lon, wchem$lat, clusterOptions = markerClusterOptions())
w ## plot widget
```
# Predicting maize grain Zn levels from soil properties with MLAs
This section introduces the main machine learning workflow of the notebook. It uses topsoil (0-20 cm) reference wet chemistry soil features to predict maize grain Zn concentrations. The heuristics that might be used for diagnosing Zn deficiencies in crop tissues with [gold-standard](https://en.wikipedia.org/wiki/Gold_standard_(test)) reference criteria and measurements for both soils and crops are operationally constrained. Soils are more easily collected and measure in an appropriate laboratory setting over the course of any given year. Whereas, the collection and analysis of grain samples is restricted to very short time periods during or immediately after harvest. Both are expensive and time consuming. So, gold standard laboratory measurements would probably be quite hard to achieve in most operational and micronutrient monitoring settings in Africa, given the associated expense, time constraints and potential lab delays. Nonetheless, they provide a baseline to compare to the spectral models that are developed in the subsequent sections.
```{r training_validation_approach, echo=FALSE, fig.align="center", fig.cap="MLA training, validation and prediction workflow.", out.width = '80%'}
knitr::include_graphics("./MW_Zn_data/training_validation.png")
```
We will be using an *algorithmic* (ML), rather than a *data modeling* based approach ([Breiman, 2001](http://staff.pubhealth.ku.dk/~tag/Teaching/share/material/Breiman-two-cultures.pdf)). See the figure directly above. To start the model fitting processes covered in this section, the next chunk scrubs some of the extraneous objects in memory, sets-up labels and features, and creates a randomized (80 / 20%) partition between the training and validation dataframes.
```{r}
rm(list=setdiff(ls(), c("sdata", "cdata"))) ## scrubs extraneous objects in memory
# Set randomization seed
seed <- 1235813
set.seed(seed)
# Split data into calibration and validation sets
gsIndex <- createDataPartition(sdata$Zn, p = 8/10, list=F, times = 1)
cal <- sdata[ gsIndex,]
val <- sdata[-gsIndex,]
# Set calibration labels
labs <- c("Zn") ## maize grain label
lcal <- as.vector(t(cal[labs]))
# Calibration features
wcal <- cal[ ,7:17]
```
## Model calibration with `caretEnsemble`
The soil features in this initial model are are wet chemistry based and include: pH, cation exchange capacity, soil organic carbon, metal oxalates and various soil Zn extractions see ([Gashu et al., 2021](https://doi.org/10.1038/s41586-021-03559-3)). We calibrate 3 models using the [`caretEnsemble`](https://cran.r-project.org/web/packages/caretEnsemble/index.html) package with [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) and default-tuning of the relevant [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)). Note that if you would like more control over the individual model fits, you can also use their respective `tuneList` arguments. Also note that all of the calculations can be (are) parallelized to make efficient use of either local, GPU or cloud-based computing resources.
```{r, results = 'hide'}
# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")
# Fit 3 calibration models using the soil wet chemistry features
wlist <- caretList(wcal, lcal,
trControl = tc,
tuneList = NULL,
methodList = c("rf", "gbm", "cubist"),
preProcess = c("center","scale"),
metric = "RMSE")
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_wlist.rds", sep = "")
saveRDS(wlist, fname)
```
## Model stacking with `caret`
The point here is not to evaluate a *best individual model* but rather to evaluate the combination of the previously fitted models against a 20% [hold-out](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another. This next chunk fits the model ensemble with the `glmStepAIC` function from the `MASS` library using the *validation dataframe*. You should explore other options here, but the current stacking option provides a reasonable combination and weighting of the 3 models that were produced in the previous ensemble training steps.
```{r, results = 'hide'}
# Individual model predictions on the validation set
val$rf <- predict(wlist$rf, val)
val$gb <- predict(wlist$gbm, val)
val$cu <- predict(wlist$cubist, val)
# Set labels and features
lval <- as.vector(t(val[labs]))
fval <- val[ ,3595:3597] ## validation features
# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Model setup and fitting
set.seed(seed)
tc <- trainControl(method="repeatedcv", number = 10, repeats = 3, allowParallel=T)
sw <- train(fval, lval,
method = "glmStepAIC",
trControl = tc,
metric = "RMSE")
val$sw <- predict(sw, val) ## stacked predictions
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_sw.rds", sep = "")
saveRDS(sw, fname)
```
```{r, echo = FALSE}
summary(sw)
```
## Ensemble prediction uncertainty estimates
There are many ways to quantify the uncertainty inherent in these predictions. We take a simple but quite robust approach here using quantile regression ([`quantreg`](https://cran.r-project.org/web/packages/quantreg/quantreg.pdf)). The main interest is in the overall spread of the ensemble predictions; that is, the 90% prediction intervals for the entire dataset.
```{r, results = 'hide'}
sdata$rf <- predict(wlist$rf, sdata)
sdata$gb <- predict(wlist$gbm, sdata)
sdata$cu <- predict(wlist$cubist, sdata)
sdata$sw <- predict(sw, sdata)
# Quantile regression fit on whole dataset
swQ <- rq(Zn~sw, tau=c(0.05,0.5,0.95), data = sdata)
```
```{r, echo = FALSE}
summary(swQ)
```
```{r, fig.align = "center", fig.cap = "Relationship between predicted (based on soil wet chemistry properties) and measured maize grain zinc (Zn) levels. The blue lines are the 5% and 95% quantile regression estimates. The red line is the regression at the median,"}
par(pty="s", mar=c(4,4,1,1))
plot(Zn~sw, xlab="Soil wet-chemistry ensemble grain Zn prediction (ppm)", ylab="Measured grain Zn (ppm)", cex.lab=1.2, xlim=c(10,50), ylim=c(10,50), sdata)
curve(swQ$coefficients[2]*x+swQ$coefficients[1], add=T, from=10, to=50, col="blue", lwd=2)
curve(swQ$coefficients[4]*x+swQ$coefficients[3], add=T, from=10, to=50, col="red", lwd=2)
curve(swQ$coefficients[6]*x+swQ$coefficients[5], add=T, from=10, to=50, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)
```
The predictions are clearly offset, but we can adjust for that. Note that this does not distract from the notion that the wet chemistry soil properties do indeed appear to be quite useful for predicting maize grain Zn levels in Malawi, plus or minus some expected outlier noise. The next short chunk adjusts for the 50% quantile (median) bias and offset and is used to generate the associated figure below.
```{r}
# Offset adjustment
sdata$sw50 <- swQ$coefficients[4]*sdata$sw+swQ$coefficients[3] ## 50% quantile adjustment
# Refit quantile regression
sw50Q <- rq(Zn~sw50, tau=c(0.05,0.5,0.95), data = sdata)
```
```{r, echo = FALSE, fig.align = "center", fig.cap = "Relationship between bias and offset adjusted predictions from soil data and measured maize grain zinc (Zn) levels."}
par(pty="s", mar=c(4,4,1,1))
plot(Zn~sw50, xlab="Adjusted ensemble grain Zn prediction (ppm)", ylab="Measured grain Zn (ppm)", cex.lab=1.2,
xlim=c(10,50), ylim=c(10,50), sdata)
curve(sw50Q$coefficients[2]*x+sw50Q$coefficients[1], add=T, from=10, to=50, col="blue", lwd=2)
curve(sw50Q$coefficients[4]*x+sw50Q$coefficients[3], add=T, from=10, to=50, col="red", lwd=2)
curve(sw50Q$coefficients[6]*x+sw50Q$coefficients[5], add=T, from=10, to=50, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)
```
**The takeaway from this section is that Zn levels in maize grain are reasonably well-predicted from *"gold-standard"*, wet chemistry soil properties.** The residual differences may be due external, environmental factors at the time of sample collection, laboratory measurement variability and/or the inability of the MLAs that were used for calibration and validation to capture the underlying bio-kinetics. The `randomForest` model was a clear winner here, but it is unusual for a single MLA to outperform all other contrasting MLAs i.e., generalized boosting and cubist in this case.
# Predicting maize grain Zn from soil and plant MIR spectra
One potentially problematic factor for many MLAs in chemometric applications is the ["Curse of dimensionality"](https://en.wikipedia.org/wiki/Curse_of_dimensionality). In this particlar case the *curse* manifests in the high dimensionality of the spectral data: *w* = 3,575 individual wavebands for both the soil and plant spectra. Many of those wavebands are strongly correlated, particularly those proximal to one another on the NIR / MIR reflectance / absorbance spectrum. These types of data pose generalization and prediction challenges for MLAs that involve e.g., [Bagging](https://en.wikipedia.org/wiki/Random_forest), [Boosting](https://en.wikipedia.org/wiki/Boosting_%28machine_learning%29), [Bayesian](https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-031219-041110) and/or [Deep learning](https://en.wikipedia.org/wiki/Deep_learning) approaches, and can lead to [overfitting](https://en.wikipedia.org/wiki/Overfitting) and its consequent model validation / generalization problems. Generally, the data would need to be dimensionally reduced and decorrelated to be reliably fit by these types of algorithms.
Other algorithms such as [Partial least squares regression (PLS)](https://en.wikipedia.org/wiki/Partial_least_squares_regression), [Ridge regression](https://en.wikipedia.org/wiki/Ridge_regression) and/or [Lasso regression](https://en.wikipedia.org/wiki/Lasso_(statistics)) handle the high-dimensional data and collinearities reasonably well in a linear context but are frequently not very good at predicting the quite common non-linear and/or threshold relationships between the spectral signatures and their reference measurements. This section of the notebook covers workflows for both types of MLAs and looks at how those might be usefully combined (stacked).
## Spectral feature conversions / pre-processing
There are some ways of mitigating MLA prediction trade-offs in the data wrangling and conversion steps of the workflow described here. The most common approach is to reduce the dimensionality and collinearity of the (spectral) data. There are a number of pre-processing techniques that can be applied in that context including: [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis), [Independent components analysis (ICA)](https://en.wikipedia.org/wiki/Independent_component_analysis) as well as other signal processing techniques such as [Non-negative matrix factorization (NMF)](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization). We use PCA here ... however, feel free to experiment. The next chunk appends the resulting principal component scores to the soil reference and spectral data frame, `sdata`.
```{r, results = 'hide'}
# Soil spectral principal components
sspec.pca <- prcomp(sdata[ ,19:3594], center=T, scale=T) ## centered and scaled soil spectral PCAs
spcas <- predict(sspec.pca, sdata)
spcas <- spcas[ ,1:20] ## save the first 20 components, which explain ~95% of the total soil spectral variability
fname <- paste("./MW_Zn_data/Results/", "spec_spcas.rds", sep = "")
saveRDS(sspec.pca, fname) ## saves soil spectral PCA model
# Merge files
sdata <- cbind(sdata, spcas)
write.csv(sdata, "./MW_Zn_data/Results/MW_soil_Zn_data.csv", row.names = F)
```
## Calibration and prediction with soil spectra
We initially calibrate 3 MLAs to the soil spectral principal components data with 10-fold cross-validation. Note that the calibration code pattern is similar to the wet-chemistry-based pattern that was generated above. We have just substituted the soil spectral principal components for the wet chemistry features but are running the same MLAs for comparison.
```{r, results = 'hide'}
# Set randomization seed
seed <- 1235813
set.seed(seed)
# Split data into calibration and validation sets
gsIndex <- createDataPartition(sdata$Zn, p = 8/10, list=F, times = 1)
cal <- sdata[ gsIndex,]
val <- sdata[-gsIndex,]
# Set calibration labels
labs <- c("Zn") ## maize grain label
lcal <- as.vector(t(cal[labs]))
# Calibration features
scal <- cal[ ,3600:3619] ## soil spectral PCAs
# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")
# Fit 3 calibration models using the spectral principal component features
slist <- caretList(scal, lcal,
trControl = tc,
tuneList = NULL,
methodList = c("rf", "gbm", "cubist"),
preProcess = c("center","scale"),
metric = "RMSE")
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_slist.rds", sep = "")
saveRDS(slist, fname)
```
We ensemble the 3 models as before using the fitted `slist` models on the validation data frame ...
```{r, results = 'hide'}
# Individual model predictions on the validation set
val$rf_slist <- predict(slist$rf, val)
val$gb_slist <- predict(slist$gbm, val)
val$cu_slist <- predict(slist$cubist, val)
# Set labels and features
lval <- as.vector(t(val[labs]))
fval <- val[ ,3620:3622] ## validation features
# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Model setup and fitting
set.seed(seed)
tc <- trainControl(method="repeatedcv", number = 10, repeats = 3, allowParallel=T)
ss <- train(fval, lval,
method = "glmStepAIC",
trControl = tc,
metric = "RMSE")
val$ss <- predict(ss, val) ## stacked predictions
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_sstack.rds", sep = "")
saveRDS(ss, fname)
```
```{r, echo = FALSE}
summary(ss)
```
This is the `slist` prediction plot below. Note that the intermediate chunks are not shown in this notebook; however, they are included in the markdown document should you wish to check.
```{r, echo = FALSE, results = 'hide'}
sdata$rf_slist <- predict(slist$rf, sdata)
sdata$gb_slist <- predict(slist$gbm, sdata)
sdata$cu_slist <- predict(slist$cubist, sdata)
sdata$ss_slist <- predict(ss, sdata)
# Quantile regression fit on whole dataset
ssQ <- rq(Zn~ss_slist, tau=c(0.05,0.5,0.95), data = sdata)
summary(ssQ)
```
```{r, echo = FALSE}
# Gain and offset adjustment
sdata$ss50 <- ssQ$coefficients[4]*sdata$ss_slist+ssQ$coefficients[3] ## 50% quantile adjustment
# Refit quantile regression
ss50Q <- rq(Zn~ss50, tau=c(0.05,0.5,0.95), data = sdata)
```
```{r, echo = FALSE, fig.align = "center", fig.cap = "Relationship between gain and offset adjusted spectral predictions and measured maize grain zinc (Zn) levels."}
par(pty="s", mar=c(4,4,1,1))
plot(Zn~ss50, xlab="Adjusted grain Zn prediction (ppm)", ylab="Measured grain Zn (ppm)", cex.lab=1.2,
xlim=c(10,50), ylim=c(10,50), sdata)
curve(ss50Q$coefficients[2]*x+ss50Q$coefficients[1], add=T, from=10, to=50, col="blue", lwd=2)
curve(ss50Q$coefficients[4]*x+ss50Q$coefficients[3], add=T, from=10, to=50, col="red", lwd=2)
curve(ss50Q$coefficients[6]*x+ss50Q$coefficients[5], add=T, from=10, to=50, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)
```
We can now also compare how the spectral predictions perform relative to the much more expensive and time consuming soil wet chemistry based predictions. The figure below shows that relationship.
```{r, echo = FALSE, fig.align = "center", fig.cap = "Relationship between bias and offset adjusted wet chemistry and spectral predictions."}
par(pty="s", mar=c(4,4,1,1))
plot(sw50~ss50, xlab="Spectral prediction (ppm)", ylab="Wet chemistry prediction (ppm)", cex.lab=1.2,
xlim=c(10,50), ylim=c(10,50), sdata)
abline(c(0,1), col="grey", lwd=1)
```
**The takeaway from this section is that maize grain Zn levels are reliably predicted using soil MIR spectra and standard, quite fast MLAs.** In this particular case there was actually no need to pursue much slower and/or computationally more challenging algorithms such as `pls`, `glmnet` and/or `xgbLinear` that are often used for regulariztion and variable selection in the analyses of high-dimensional, chemometric datasets. A simple PCA data reduction step of the raw MIR spectra appears to have been quite sufficient for both ensemble model calibration and validation.
## Calibration and prediction with crop spectra
Again, we initially calibrate 3 MLAs to the soil spectral principal components data with 10-fold cross-validation. Note that the calibration and validation code patterns are similar to both the wet-chemistry based and the soil MIR spectral patterns that were generated above. We have just substituted the crop spectral principal components for the soil principal component features but are running the same MLAs for comparison. Because the script is very similar to that in the above sections, most of it is not shown in the notebook. But of course, you can find everything in the markdown document on [Github](https://github.com/mgwalsh/Bioavailability/blob/master/Zn_preds.Rmd), should you wish to explore more.
```{r, echo = FALSE, results = 'hide'}
# Crop spectral principal components
cspec.pca <- prcomp(sdata[ ,19:3594], center=T, scale=T) ## centered and scaled crop spectral PCAs
cpcas <- predict(cspec.pca, cdata)
cpcas <- cpcas[ ,1:20] ## save the first 20 components, which explain ~95% of the total crop spectral variability
fname <- paste("./MW_Zn_data/Results/", "spec_cpcas.rds", sep = "")
saveRDS(cspec.pca, fname) ## saves crop spectral PCA model
# Merge files
cdata <- cbind(cdata, cpcas)
write.csv(cdata, "./MW_Zn_data/Results/MW_crop_Zn_data.csv", row.names = F)
```
```{r, echo = FALSE, results = 'hide'}
# Set randomization seed
seed <- 1235813
set.seed(seed)
# Split data into calibration and validation sets
gsIndex <- createDataPartition(cdata$Zn, p = 8/10, list=F, times = 1)
cal <- cdata[ gsIndex,]
val <- cdata[-gsIndex,]
# Set calibration labels
labs <- c("Zn") ## maize grain label
lcal <- as.vector(t(cal[labs]))
# Calibration features
ccal <- cal[ ,3595:3614] ## crop spectral PCAs
# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")
# Fit 3 calibration models using the spectral principal component features
clist <- caretList(ccal, lcal,
trControl = tc,
tuneList = NULL,
methodList = c("rf", "gbm", "cubist"),
preProcess = c("center","scale"),
metric = "RMSE")
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_clist.rds", sep = "")
saveRDS(clist, fname)
```
```{r, results = 'hide'}
# Individual model predictions on the validation set
val$rf_clist <- predict(clist$rf, val)
val$gb_clist <- predict(clist$gbm, val)
val$cu_clist <- predict(clist$cubist, val)
# Set labels and features
lval <- as.vector(t(val[labs]))
fval <- val[ ,3615:3617] ## validation features
# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Model setup and fitting
set.seed(seed)
tc <- trainControl(method="repeatedcv", number = 10, repeats = 3, allowParallel=T)
sc <- train(fval, lval,
method = "glmStepAIC",
trControl = tc,
metric = "RMSE")
val$sc <- predict(sc, val) ## stacked predictions
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_cstack.rds", sep = "")
saveRDS(sc, fname)
```
```{r, echo = FALSE, results = 'hide'}
summary(sc)
```
```{r, echo = FALSE, results = 'hide'}
cdata$rf_clist <- predict(clist$rf, cdata)
cdata$gb_clist <- predict(clist$gbm, cdata)
cdata$cu_clist <- predict(clist$cubist, cdata)
cdata$sc_clist <- predict(sc, cdata)
# Quantile regression fit on whole dataset
scQ <- rq(Zn~sc_clist, tau=c(0.05,0.5,0.95), data = cdata)
summary(scQ)
```
```{r, echo = FALSE}
# Gain and offset adjustment
cdata$sc50 <- scQ$coefficients[4]*cdata$sc_clist+scQ$coefficients[3] ## 50% quantile adjustment
# Refit quantile regression
sc50Q <- rq(Zn~sc50, tau=c(0.05,0.5,0.95), data = cdata)
```
The initial grain Zn predictions using crop MIR spectra do not look as promising as those derived from the soil data. This is somewhat counterintuitive given that the spectra were measured on the grain itself. So, we will also try a variable regularization and selection approach here that fits 3 additional MLAs, [`pls`](https://www.rdocumentation.org/packages/mixOmics/versions/6.3.2/topics/pls), [`glmnet`](https://www.rdocumentation.org/packages/glmnet/versions/4.1-2/topics/glmnet) and [`xgbLinear`](https://cran.r-project.org/web/packages/xgboost/xgboost.pdf), and will then re-ensemble all of the available individual predictions. Note that this next calibration step can take up to ~1 hour to run on a normal computer with 8 cores and 16 Gb of RAM ... there are just a lot of spectral features to sort through. Again, you can find the script in the associated markdown document on [Github](https://github.com/mgwalsh/Bioavailability/blob/master/Zn_preds.Rmd).
```{r, echo = FALSE, results = 'hide'}
# Set randomization seed
seed <- 1235813
set.seed(seed)
# Split data into calibration and validation sets
gsIndex <- createDataPartition(cdata$Zn, p = 8/10, list=F, times = 1)
cal <- cdata[ gsIndex,]
val <- cdata[-gsIndex,]
# Set calibration labels
labs <- c("Zn") ## maize grain label
lcal <- as.vector(t(cal[labs]))
# Calibration features
ccal <- cal[ ,19:3594] ## crop spectral wavebands
# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")
# Fit 3 calibration models using the spectral principal component features
rlist <- caretList(ccal, lcal,
trControl = tc,
tuneList = NULL,
methodList = c("pls", "glmnet", "xgbLinear"),
preProcess = c("center","scale"),
metric = "RMSE")
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_rlist.rds", sep = "")
saveRDS(rlist, fname)
```
```{r, echo = FALSE, results = 'hide'}
fval <- select(val,19:3594) ## full spectral signatures
pval <- select(val,3595:3614) ## PCA scores
# Individual model predictions on the validation set
val$rf_clist <- predict(clist$rf, pval)
val$gb_clist <- predict(clist$gbm, pval)
val$cu_clist <- predict(clist$cubist, pval)
val$pl_rlist <- predict(rlist$pls, fval)
val$gl_rlist <- predict(rlist$glmnet, fval)
val$xl_rlist <- predict(rlist$xgbLinear, fval)
# Set labels and features
lval <- as.vector(t(val[labs]))
fval <- select(val,3615:3617,3620:3622) ## validation features
# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Model setup and fitting
set.seed(seed)
tc <- trainControl(method="repeatedcv", number = 10, repeats = 3, allowParallel=T)
sc <- train(fval, lval,
method = "glmStepAIC",
trControl = tc,
metric = "RMSE")
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_cstack.rds", sep = "")
saveRDS(sc, fname)
```
```{r, echo = FALSE}
summary(sc)
```
```{r}
# All feature stack
fval <- select(cdata,19:3594) ## full spectral signatures
pval <- select(cdata,3595:3614) ## PCA scores
# Individual model predictions on the validation set
cdata$rf_clist <- predict(clist$rf, pval)
cdata$gb_clist <- predict(clist$gbm, pval)
cdata$cu_clist <- predict(clist$cubist, pval)
cdata$pl_rlist <- predict(rlist$pls, fval)
cdata$gl_rlist <- predict(rlist$glmnet, fval)
cdata$xl_rlist <- predict(rlist$xgbLinear, fval)
cdata$sc <- predict(sc, cdata)
# Quantile regression fit on whole dataset
scQ <- rq(Zn~sc, tau=c(0.05,0.5,0.95), data = cdata)
summary(scQ)
```
```{r, echo = FALSE, results = 'hide'}
# Gain and offset adjustment
cdata$sc50 <- scQ$coefficients[4]*cdata$sc+scQ$coefficients[3] ## 50% quantile offset / bias adjustment
# Refit quantile regression
sc50Q <- rq(Zn~sc50, tau=c(0.05,0.5,0.95), data = cdata)
```
```{r, echo = FALSE, fig.align = "center", fig.cap = "Relationship between gain and offset adjusted spectral predictions and measured maize grain zinc (Zn) levels."}
par(pty="s", mar=c(4,4,1,1))
plot(Zn~sc50, xlab="Adjusted grain Zn prediction (ppm)", ylab="Measured grain Zn (ppm)", cex.lab=1.2,
xlim=c(10,50), ylim=c(10,50), cdata)
curve(sc50Q$coefficients[2]*x+sc50Q$coefficients[1], add=T, from=10, to=50, col="blue", lwd=2)
curve(sc50Q$coefficients[4]*x+sc50Q$coefficients[3], add=T, from=10, to=50, col="red", lwd=2)
curve(sc50Q$coefficients[6]*x+sc50Q$coefficients[5], add=T, from=10, to=50, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)
```
**The takeaway from this section is that the measured Zn levels in maize grain are currently not well predicted by maize grain MIR spectra.** We are not sure about the causes of this but it does not seem to be an underlying problem of the MLAs that were used. One possibility could be that VIS / NIR spectra might actually be more useful in this context, but this is certainly an open question. Another possibility is that the data are simply misaligned in the database. For the time being we essentially appear to be just fitting noise here. Useful dignostics in this context would be to cross-check the crop spectra against other elements e.g., grain phosphorus, calcium, magnesium, iron levels etc.
# Takeaways
The main takeaways from this notebook are the following:
* Maize grain Zn levels are well predicted from both soil wet-chemistry and MIR data with standard MLAs. The main benefit of using MIR in this case is to increase lab throughputs and to reduce the costs of obtaining risk factor data for micronutrient deficiency monitoring. This might advance localized Zn deficiency diagnostics and risk assessments that could potentially be carried out routinely, at low cost, and across large geographical regions and populations of interest.
* Maize grain Zn levels could not be reliably predicted from the corresponding maize grain MIR spectra. This is a somewhat surprising and counterintuitive result that should be explored further, given the quite reasonable predictions that were obtained using the soil spectra. As a first check, we recommend that the data are examined again to ensure that the crop Zn labels are properly aligned with their corresponding MIR features in the database that was available to us.
* There is scope within the [GeoNutrition project](http://www.geonutrition.com) to also test this approach for other crop mineral micronutrient levels (e.g., calcium, iron, iodine and selenium) that might of concern in particular regions of interest and/or in other crops and crop tissues. Additional surveys of the geographical distributions of [Indicator plants](https://www.biologydiscussion.com/plants/plant-indicators-characteristics-type-and-physiological-changes/6970), other than maize, might also be useful for determining where specific micronutrient deficiencies occur in cropland landscapes.
* Computationally, all of the code presented in this notebook runs fairly fast and probably could be automated, such that when a (soil) MIR measurement is received in a laboratory, a maize grain Zn prediction could be issued in near-real time. Note that for other spectrometer setups the associated [calibration / validation transfer](https://journals.sagepub.com/doi/full/10.1177/0003702817736064) steps would need to be taken into account. This is also a good reason to properly curate any physical soil reference samples.
Any questions or comments about this notebook are most welcome via [AFSIS](mailto:mgw.africasoils.info).