-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathB01_ensembles.qmd
190 lines (134 loc) · 10.7 KB
/
B01_ensembles.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "Ensembles"
format: html
---
> There are no passengers on spaceship earth. We are all crew.
> ~ [Marshall McLuhan](https://en.wikipedia.org/wiki/Marshall_McLuhan)
Modeling starts with a collection of observations (presence and background for us!) and ends up with a collection of coeefficients that can be used with one or more formulas to make a predicition for the past, the present or the future. We are using modeling specifically to make habitat suitability maps for select species under two climate scenarios (RCP45 and RCP85) at two different times (2055 and 2075) in the future.
Models can be used in isolation, or they can be used in combinations, called ensembles. When working with ensembles, we are chosing to create two or more models. When we use that ensemble to make a prediction, each model produces its own prediction and then we combine them (usually something like the average.)
# Setup
As always, we start by running our setup function. Start RStudio/R, and relaod your project with the menu `File > Recent Projects`.
```{r setup}
source("setup.R")
```
# Load data - choose a month and sampling approach
Let's load what we need to build a model for August using the greedy sampling technique. We'll also need the model configuration (which is "g008"). And we'll need the covariate data.
```{r load_data}
model_input = read_model_input(scientificname = "Mola mola",
approach = "greedy",
mon = "Aug")
cfg = read_configuration(version = "g_Aug")
db = brickman_database()
depth = read_brickman(db |> filter(scenario == "STATIC", var == "depth"))
covars = read_brickman(db |> filter(scenario == "PRESENT", interval == "mon"))
```
Of course we need covariates for August only, for this we can use a function we prepared earlier, `prep_model_data()`.
```{r august_data}
variables = prep_model_data(model_input, month = "Aug",
covars = covars, depth = depth)
variables
```
# Set up subsampling
We shall set up spatial cross-folds for our data so that we can have the model run repeatedly (without a lot of effort). Spatial cross-folds are just subsamples of the input data set. They are easy to make, and easy to understand with a map.
```{r crossfold}
crossfolds <- spatial_block_cv(data = variables,
v = 3,
n = 5)
autoplot(crossfolds)
```
You can see that the domain is cut into a 5 x 5 grid with each grid cell assigned to one of three groups ("folds") randomly. Cross-folds can have varying numbers of points.
# Building the ensemble of models
Now we have all we need to build a model. We first make what is called a recipe... it's just a placeholder. Then we build a work flow where we specify the types of models we want along with the crossfold sampling.
## Make a recipe
A recipe at a bare minimum needs to know two things: what is that data it has to work with and what is the relationship among the variables. The latter is expressed as a formula, very similar to how we specify the formula of a line with `y = mx + b` or a parabola `y = ax^2 + bx + c`.
:::{.callout-note}
We often think of formulas as left-hand side (LHS) and right-hand side (RHS) equalities. And usually, the LHS is the outcome while the RHS is about the inputs. For our modeling, the outcome is to predict the across the entire domain. We can generalize the idea with the "is a function of" operator `~` (the tilde). For yhe classic formula for a line it like this... `y ~ x`.
Consider a situation where we have reduced all of the suitabke variables to `Sbtm`, `Tbtm`, `MLD` and`Xbtm`, which we have in a table along with a `class` variable.
In our case we have the outcome is an prediction of `class` it is a function of variables like `Sbtm`, `Tbtm`, `MLD`, `Xbtm`, *etc.* This formula would look like `y ~ Sbtm + Tbtm + MLD + Xbtm`. Unlike the specific equation for a line or parabola, we don't pretend to know what coefficients, powers and that sort of stuff looks like. We are just saying that `class` is a function of all of those variables (somehow).
In the case where table where the outcome (`class`) is a function of all other variables in the table, we have a nice short hand. `class ~ .` where the dot means "every other variable".
:::
Below we make a recipe which is like template. We provide the table of variables (with spatial info) - tehnically we need only need to provide one row as it is building a template, it doesn't store the data (yet). We also provide the formula which we read as "class as a funtion of everything else". The `recipe()` knows detects the geometry column and handles that specially.
```{r recipe}
aug_recipe = recipe(variables, formula = class ~ .)
aug_recipe
```
## Make a workflow set
Here's where we specify which model(s) we want, and that we want them to be treated as an ensemble rather than as a individual models.
```{r workflows}
models <-
# create the workflow_set
workflow_set(
preproc = list(default = aug_recipe),
models = list(
# the standard glm specs - no tuneable parameters available
glm = sdm_spec_glm(),
# rf specs with tuning
rf = sdm_spec_rf(),
# boosted tree model (gbm) specs with tuning
gbm = sdm_spec_boost_tree(),
# maxent specs with tuning
maxent = sdm_spec_maxent()
),
# make all combinations of preproc and models,
cross = TRUE ) |>
# tweak controls to store information needed later to create the ensemble
option_add(control = control_ensemble_grid())
```
We have chosen these 4 models types to be memebers of our ensemble because they are common models types used for species distribution models. We aren't limited to these, but we will use these for the tutorial.
When we print the `models` object we see that it, too, is a table, but some of it's columns are `list` types, and some have empty elements. That's because we still have not actually modeled anything... we are still in the building phase.
```{r print_models}
models
```
## Auto-tuning the ensemble models
Now we can allow the underlying software to run multiple times as it looks to optimize the different parameters that each model has (if the model has tuneable parameters). Here we ask for the software iterate over the `crossfolds` sampling set for each model in the ensemble. We then request the standard diagnostics metrics used in species distribution modeling:[ Boyce's Continuous Index](https://www.sciencedirect.com/science/article/abs/pii/S0304380002002004), [the area under the Receiver Operator Curve (auc)](https://link.springer.com/article/10.1023/A:1010920819831), and [True Skills Statistic (tss)](https://www.sciencedirect.com/science/article/pii/S1470160X22013036).
Depending upon the complexity of your set up this can take a while. We no longer are working with a template, but rather with full-blown datasets.
```{r tune}
models <- models |>
workflow_map("tune_grid",
resamples = crossfolds,
grid = 3,
metrics = sdm_metric_set(),
verbose = TRUE)
```
We get some messages as it processes, but it is hard for the novice to know what they mean, but at least they aren't error messages. But we can print the updated models and compare to what we had before.
```{r print_models_again}
models
```
We can see that there is more information in there than there was before. Note that the `result` for `default_glm` is labeled differently than the others; that's because that particular model doesn't have any tuneable parameters so the software couldn't tune anything.
## Ensemble model metrics
Let's plot the models in the ensemble, which will reveal some of the diagnostic metrics.
```{r plot_models}
autoplot(models)
```
Notice that there are three replicates for each model **except** for `logistic_reg` which is the `glm` that had no tuneable parameters. Also note that the order they are represented in each plot is established by the order of the first plot and then carried through to the others. We can probably agree by examining the left hand plot, that the `maxent` model has the highest Boyce Continuous Index which means it's rank is 1. Confusingly higher rank values means lower performance; maybe we should think of it a "first place".
**But what decides that Boyce Continuous Index should be allowed to establish the order? ** It's just alphabetical order! To be frank, the workflow rank is not all that meaningful.
At this point we can decide of we are satisfied with how the individual models perform with the crossfolds. If we found them lacking we could go back to add or remove models, adjust our sampling scheme, change the covariate selection or anything else we felt might improve the performance. We won't do that now, but we'll accept the tuned parameters "as is" and push ahead.
## Establish the ensemble
Now we come to point we make a leap and build the actual ensemble - this is the thing we shall be able to use to make predictions. We only need to specify which of the metrics to preferentially use when selecting the optimized parameters. Are you ready for this?
```{r build_ensemble}
ensemble <-
simple_ensemble() |> # make an empty ensemble
add_member(models, metric = "boyce_cont") # populate it
```
That's it - tahdah! We're done! We can plot it.
```{r plot_ensemble}
autoplot(ensemble)
```
Notice that we are no longer looking at the crossfold replicates but instead we are looking at summary of the individual models run on the complete data sets. Again, "Workflow Rank" in an arbitrary order. We can see that `maxent` performs well on all of the metrics, but Random Forests and Boosted Regression Trees are pretty good, too. The Generalize Linear Model seems to be the odd one out, which isn't to say that it doesn't work.
We can also pull out a table of these metric results.
```{r ensemble_table}
ensemble |>
collect_metrics()
```
Useful, but prettier as a plot!
# Saving an ensemble for later use
We can save the ensemble to load back in later, we just need to provide the version information.
```{r write_ensemble}
write_ensemble(ensemble, version = "g_Aug")
```
# Recap
We have built an ensemble of 4 models using a recipe and workflow. Models were partially designed using variables we selected in an earlier step. We allowed the model building process to "self-tune" by dividing the sample set into multi-folds which are orderly subsamples of our input data usually call "cross-folds". We then looked at some diagnostic metrics to help us understand model performance, before building an ensemble of models. We saved the ensemble of models for restoring later.
::: {.callout-note appearance="simple"}
# Coding Assignment
Prepare and save an ensemble for each month for both "greedy" and "conservative" sampling approaches.
:::