forked from jgabry/stancon2018helsinki_intro
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPest_Control_Example.Rmd
1834 lines (1454 loc) · 64.9 KB
/
Pest_Control_Example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "StanCon 2018 Helsinki Intro Workshop"
author: ""
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_depth: 2
---
## Setup
```{r setup}
knitr::opts_chunk$set(
echo = TRUE,
dev = "png",
dpi = 150,
fig.align = "center",
comment = NA
)
library(rstan)
library(dplyr)
library(lubridate)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default())
# seed for R's pseudo-RNGs, not Stan's
set.seed(1123)
```
## The problem
### Background
Imagine that you are a statistician or data scientist working as an independent
contractor. One of your clients is a company that owns many residential buildings
throughout New York City. The property manager explains that they are concerned about the number
of cockroach complaints that they receive from their buildings. Previously
the company has offered monthly visits from a pest inspector as a solution to
this problem. While this is the default solution of many property managers in
NYC, the tenants are rarely home when the inspector visits, and so the manager
reasons that this is a relatively expensive solution that is currently not very
effective.
One alternative to this problem is to deploy long term bait stations. In this
alternative, child and pet safe bait stations are installed throughout the
apartment building. Cockroaches obtain quick acting poison from these stations
and distribute it throughout the colony. The manufacturer of these bait stations
provides some indication of the space-to-bait efficacy, but the manager suspects
that this guidance was not calculated with NYC roaches in mind. NYC roaches, the
manager rationalizes, have more hustle than traditional roaches; and NYC
buildings are built differently than other common residential buildings in the
US. This is particularly important as the unit cost for each bait station per
year is quite high.
### The goal
The manager wishes to employ your services to help them to find the optimal
number of roach bait stations they should place in each of their buildings in
order to minimize the number of cockroach complaints while also keeping
expenditure on pest control affordable.
A subset of the company's buildings have been randomly selected for an experiment:
* At the beginning of each month, a pest inspector randomly places a number of
bait stations throughout the building, without knowledge of the current
cockroach levels in the building
* At the end of the month, the manager records
the total number of cockroach complaints in that building.
* The manager would like to determine the optimal number of traps ($\textrm{traps}$) that
balances the lost revenue ($R$) that complaints ($\textrm{complaints}$) generate
with the all-in cost of maintaining the traps ($\textrm{TC}$).
Fortunately, Bayesian data analysis provides a coherent framework for us to tackle this problem.
Formally, we are interested in finding
$$
\arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[R(\textrm{complaints}(\textrm{traps})) - \textrm{TC}(\textrm{traps})]
$$
The property manager would also, if possible, like to learn how these results
generalize to buildings they haven't treated so they can understand the
potential costs of pest control at buildings they are acquiring as well as for
the rest of their building portfolio.
As the property manager has complete control over the number of traps set, the
random variable contributing to this expectation is the number of complaints
given the number of traps. We will model the number of complaints as a function
of the number of traps.
## The data
The data provided to us is in a file called `pest_data.RDS`. Let's
load the data and see what the structure is:
```{r load-data}
pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)
```
We have access to the following fields:
* `complaints`: Number of complaints per building per month
* `building_id`: The unique building identifier
* `traps`: The number of traps used per month per building
* `date`: The date at which the number of complaints are recorded
* `live_in_super`: An indicator for whether the building as a live-in super
* `age_of_building`: The age of the building
* `total_sq_foot`: The total square footage of the building
* `average_tenant_age`: The average age of the tenants per building
* `monthly_average_rent`: The average monthly rent per building
* `floors`: The number of floors per building
First, let's see how many buildings we have data for:
```{r describe-data}
N_buildings <- length(unique(pest_data$building_id))
N_buildings
```
And make some plots of the raw data:
```{r data-plots}
ggplot(pest_data, aes(x = complaints)) +
geom_bar()
ggplot(pest_data, aes(x = traps, y = complaints, color = live_in_super == TRUE)) +
geom_jitter()
```
```{r, data-plots-ts, fig.width = 6, fig.height = 8}
ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) +
geom_line(aes(linetype = "Number of complaints")) +
geom_point(color = "black") +
geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", size = 0.25) +
facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) +
scale_x_date(name = "Month", date_labels = "%b") +
scale_y_continuous(name = "", limits = range(pest_data$complaints)) +
scale_linetype_discrete(name = "") +
scale_color_discrete(name = "Live-in super")
```
The first question we'll look at is just whether the number of complaints per
building per month is associated with the number of bait stations per building
per month, ignoring the temporal and across-building variation (we'll come back
to those sources of variation later in the document). That requires only two
variables, $\textrm{complaints}$ and $\textrm{traps}$. How should we model the
number of complaints?
## Bayesian workflow
See slides
## Modeling count data : Poisson distribution
We already know some rudimentary information about what we should expect. The
number of complaints over a month should be either zero or an integer. The
property manager tells us that it is possible but unlikely that number of
complaints in a given month is zero. Occasionally there are a very large number
of complaints in a single month. A common way of modeling this sort of skewed,
single bounded count data is as a Poisson random variable. One concern about
modeling the outcome variable as Poisson is that the data may be
over-dispersed, but we'll start with the Poisson model and then check
whether over-dispersion is a problem by comparing our model's predictions
to the data.
### Model
Given that we have chosen a Poisson regression, we define the likelihood to be
the Poisson probability mass function over the number bait stations placed in
the building, denoted below as `traps`. This model assumes that the mean and
variance of the outcome variable `complaints` (number of complaints) is the
same. We'll investigate whether this is a good assumption after we fit the
model.
For building $b = 1,\dots,10$ at time (month) $t = 1,\dots,12$, we have
$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$
Let's encode this probability model in a Stan program.
### Writing our first Stan model
* Write `simple_poisson_regression.stan` together
### Making sure our code is right
However, before we fit the model to real data, we should check that our
model works well with simulated data. We'll simulate data according to the model
and then check that we can sufficiently recover the parameter values used
in the simulation.
* Write `simple_poisson_regression_dgp.stan` together
We can use the `stan()` function to compile and fit the model, but here we will
do the compilation and fitting in two stages to demonstrate what is really
happening under the hood.
First we will compile the Stan program (`simple_poisson_regression_dgp.stan`) that
will generate the fake data.
```{r , cache=TRUE, results="hide", message=FALSE}
comp_dgp_simple <- stan_model('stan_programs/simple_poisson_regression_dgp.stan')
```
Printing this object shows the Stan program:
```{r}
print(comp_dgp_simple)
```
Now we can simulate the data by calling the `sampling()` function.
```{r runpoissondgp}
fitted_model_dgp <- sampling(
comp_dgp_simple,
data = list(N = nrow(pest_data), mean_traps = mean(pest_data$traps)),
chains = 1,
iter = 1,
algorithm = 'Fixed_param',
seed = 123
)
# see http://mc-stan.org/rstan/articles/stanfit_objects.html for various
# ways of extracting the contents of the stanfit object
samps_dgp <- rstan::extract(fitted_model_dgp)
str(samps_dgp)
```
### Fit the model to the fake data:
In order to pass the fake data to our Stan program using RStan, we need to
arrange the data into a named list. The names must match the names used
in the `data` block of the Stan program.
```{r}
stan_dat_fake <- list(
N = nrow(pest_data),
traps = samps_dgp$traps[1, ],
complaints = samps_dgp$complaints[1, ]
)
str(stan_dat_fake)
```
Now that we have the simulated data we fit the model to see if we can recover
the `alpha` and `beta` parameters used in the simulation.
```{r , cache=TRUE, results="hide", message=FALSE}
comp_model_P <- stan_model('stan_programs/simple_poisson_regression.stan')
fit_model_P <- sampling(comp_model_P, data = stan_dat_fake, seed = 123)
posterior_alpha_beta <- as.matrix(fit_model_P, pars = c('alpha','beta'))
head(posterior_alpha_beta)
```
### Assess parameter recovery
```{r}
true_alpha_beta <- c(samps_dgp$alpha, samps_dgp$beta)
mcmc_recover_hist(posterior_alpha_beta, true = true_alpha_beta)
```
We don't do a great job recovering the parameters here simply because we're
simulating so few observations that the posterior uncertainty remains rather
large, but it looks at least _plausible_ ($\alpha$ and $\beta$ are contained
within the histograms). If we did the simulation with many more observations the
parameters would be estimated much more precisely.
We should also check if the `y_rep` datasets (in-sample predictions) that we coded
in the `generated quantities` block are similar to the `y` (complaints) values
we conditioned on when fitting the model. (The **bayesplot**
package [vignettes](http://mc-stan.org/bayesplot/articles/graphical-ppcs.html)
are a good resource on this topic.)
Here is a plot of the density estimate of the observed data compared to
200 of the `y_rep` datasets:
```{r}
y_rep <- as.matrix(fit_model_P, pars = "y_rep")
ppc_dens_overlay(y = stan_dat_fake$complaints, yrep = y_rep[1:200, ])
```
In the plot above we have the kernel density estimate of the observed data ($y$,
thicker curve) and 200 simulated data sets ($y_{rep}$, thin curves) from the
posterior predictive distribution. If the model fits the data well, as it does
here, there is little difference between the observed dataset and the simulated
datasets.
Another plot we can make for count data is a rootogram. This is a plot of the
expected counts (continuous line) vs the observed counts (blue histogram). We
can see the model fits well because the observed histogram matches the expected
counts relatively well.
```{r}
ppc_rootogram(stan_dat_fake$complaints, yrep = y_rep)
```
### Fit with real data
To fit the model to the actual observed data we'll first create a list to pass
to Stan using the variables in the `pest_data` data frame:
```{r stan-data}
stan_dat_simple <- list(
N = nrow(pest_data),
complaints = pest_data$complaints,
traps = pest_data$traps
)
```
As we have already compiled the model, we can jump straight to sampling from it.
```{r fit_P_real_data, cache=TRUE}
fit_P_real_data <- sampling(comp_model_P, data = stan_dat_simple)
```
and printing the parameters. What do these tell us?
```{r results_simple_P}
print(fit_P_real_data, pars = c('alpha','beta'))
```
We can also plot the posterior distributions:
```{r hist_simple_P}
mcmc_hist(as.matrix(fit_P_real_data, pars = c('alpha','beta')))
```
As we expected, it appears the number of bait stations set in a building is
associated with the number of complaints about cockroaches that were made in the
following month. However, we still need to consider how well the model fits.
### Posterior predictive checking
```{r}
y_rep <- as.matrix(fit_P_real_data, pars = "y_rep")
```
```{r marginal_PPC}
ppc_dens_overlay(y = stan_dat_simple$complaints, y_rep[1:200,])
```
As opposed to when we fit the model to simulated data above, here the simulated
datasets is not as dispersed as the observed data and don't seem to capture the
rate of zeros in the observed data. The Poisson model may not be sufficient for
this data.
Let's explore this further by looking directly at the proportion of zeros in the
real data and predicted data.
```{r}
prop_zero <- function(x) mean(x == 0)
ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero")
```
The plot above shows the observed proportion of zeros (thick vertical line) and
a histogram of the proportion of zeros in each of the simulated data sets. It is
clear that the model does not capture this feature of the data well at all.
This next plot is a plot of the standardised residuals of the observed vs predicted number of complaints.
```{r}
mean_y_rep <- colMeans(y_rep)
std_resid <- (stan_dat_simple$complaints - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)
```
As you can see here, it looks as though we have more positive residuals than negative,
which indicates that the model tends to underestimate the number of complaints
that will be received.
The rootogram is another useful plot to compare the observed vs expected number
of complaints. This is a plot of the expected counts (continuous line) vs the
observed counts (blue histogram):
```{r}
ppc_rootogram(stan_dat_simple$complaints, yrep = y_rep)
```
If the model was fitting well these would be relatively similar, however in this
figure we can see the number of complaints is underestimated if there are few
complaints, over-estimated for medium numbers of complaints, and underestimated
if there are a large number of complaints.
We can also view how the predicted number of complaints varies with the number
of traps. From this we can see that the model doesn't seem to fully capture the
data.
```{r}
ppc_intervals(
y = stan_dat_simple$complaints,
yrep = y_rep,
x = stan_dat_simple$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
```
Specifically, the model doesn't capture the tails of the observed data very
well.
## Expanding the model: multiple predictors
Modeling the relationship between complaints and bait stations is the simplest
model. We can expand the model, however, in a few ways that will be beneficial
for our client. Moreover, the manager has told us that they expect there are a
number of other reasons that one building might have more roach complaints than
another.
### Interpretability
Currently, our model's mean parameter is a rate of complaints per 30 days, but
we're modeling a process that occurs over an area as well as over time. We have
the square footage of each building, so if we add that information into the
model, we can interpret our parameters as a rate of complaints per square foot
per 30 days.
$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$
The term $\text{sq_foot}$ is called an exposure term. If we log the term, we can
put it in $\eta_{b,t}$:
$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b
\end{align*}
$$
A quick test shows us that there appears to be a relationship between the square
footage of the building and the number of complaints received:
```{r}
ggplot(pest_data, aes(x = log(total_sq_foot), y = log1p(complaints))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```
Using the property manager's intuition, we include two extra pieces of
information we know about the building - the (log of the) square floor space and
whether there is a live in super or not - into both the simulated and real data.
```{r}
stan_dat_simple$log_sq_foot <- log(pest_data$total_sq_foot/1e4)
stan_dat_simple$live_in_super <- pest_data$live_in_super
```
### Stan program for Poisson multiple regression
Now we need a new Stan model that uses multiple predictors.
* Write `multiple_poisson_regression.stan` together
### Simulate fake data with multiple predictors
```{r compmultPDGP, cache=TRUE, results="hide", message=FALSE}
comp_dgp_multiple <- stan_model('stan_programs/multiple_poisson_regression_dgp.stan')
```
```{r runpoissondgp2}
fitted_model_dgp <-
sampling(
comp_dgp_multiple,
data = list(N = nrow(pest_data)),
chains = 1,
cores = 1,
iter = 1,
algorithm = 'Fixed_param',
seed = 123
)
samps_dgp <- rstan::extract(fitted_model_dgp)
```
Now pop that simulated data into a list ready for Stan.
```{r}
stan_dat_fake <- list(
N = nrow(pest_data),
log_sq_foot = samps_dgp$log_sq_foot[1, ],
live_in_super = samps_dgp$live_in_super[1, ],
traps = samps_dgp$traps[1, ],
complaints = samps_dgp$complaints[1, ]
)
```
And then compile and fit the model we wrote for the multiple regression.
```{r, cache=TRUE, message=FALSE, warning=FALSE}
comp_model_P_mult <- stan_model('stan_programs/multiple_poisson_regression.stan')
fit_model_P_mult <- sampling(comp_model_P_mult, data = stan_dat_fake, chains = 4, cores = 4)
```
Then compare these parameters to the true parameters:
```{r}
posterior_alpha_beta <- as.matrix(fit_model_P_mult, pars = c('alpha','beta','beta_super'))
true_alpha_beta <- c(samps_dgp$alpha,samps_dgp$beta,samps_dgp$beta_super)
mcmc_recover_hist(posterior_alpha_beta, true = true_alpha_beta)
```
We've recovered the parameters sufficiently well, so we've probably coded the
Stan program correctly and we're ready to fit the real data.
### Fit the real data
Now let's use the real data and explore the fit.
```{r fit_mult_P_real_dat}
fit_model_P_mult_real <- sampling(comp_model_P_mult, data = stan_dat_simple)
y_rep <- as.matrix(fit_model_P_mult_real, pars = "y_rep")
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])
```
This again looks like we haven't captured the smaller counts very well, nor
have we captured the larger counts.
```{r}
prop_zero <- function(x) mean(x == 0)
ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero", binwidth = 0.01)
```
We're still severely underestimating the proportion of zeros in the data. Ideally
this vertical line would fall somewhere within the histogram.
We can also plot uncertainty intervals for the predicted complaints for different
numbers of traps.
```{r}
ppc_intervals(
y = stan_dat_simple$complaints,
yrep = y_rep,
x = stan_dat_simple$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
```
We can see that we've increased the tails a bit more at the larger numbers of traps
but we still have some large observed numbers of complaints that the model
would consider extremely unlikely events.
## Modeling count data with the Negative Binomial
When we considered modelling the data using a Poisson, we saw that the model
didn't appear to fit as well to the data as we would like. In particular the
model underpredicted low and high numbers of complaints, and overpredicted the
medium number of complaints. This is one indication of over-dispersion, where
the variance is larger than the mean. A Poisson model doesn't fit over-dispersed
count data very well because the same parameter $\lambda$, controls both the
expected counts and the variance of these counts. The natural alternative to
this is the negative binomial model:
$$
\begin{align*}
\text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b}
\end{align*}
$$
In Stan the negative binomial mass function we'll use is called
$\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)$
in Stan. Like the `poisson_log` function, this negative binomial mass function
that is parameterized in terms of its log-mean, $\eta$, but it also has a
precision $\phi$ such that
$$
\mathbb{E}[y] \, = \lambda = \exp(\eta)
$$
$$
\text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi.
$$
As $\phi$ gets larger the term $\lambda^2 / \phi$ approaches zero and so
the variance of the negative-binomial approaches $\lambda$, i.e., the
negative-binomial gets closer and closer to the Poisson.
### Stan program for negative-binomial regression
* Write `multiple_NB_regression.stan` together
### Fake data fit: Multiple NB regression
```{r , cache=TRUE, results="hide", message=FALSE}
comp_dgp_multiple_NB <- stan_model('stan_programs/multiple_NB_regression_dgp.stan')
```
We're going to generate one draw from the fake data model so we can use the data
to fit our model and compare the known values of the parameters to the posterior
density of the parameters.
```{r fake_data_dgp_NB}
fitted_model_dgp_NB <-
sampling(
comp_dgp_multiple_NB,
data = list(N = nrow(pest_data)),
chains = 1,
cores = 1,
iter = 1,
algorithm = 'Fixed_param',
seed = 123
)
samps_dgp_NB <- rstan::extract(fitted_model_dgp_NB)
```
Create a dataset to feed into the Stan model.
```{r NB_fake_stan_dat}
stan_dat_fake_NB <- list(
N = nrow(pest_data),
log_sq_foot = samps_dgp_NB$log_sq_foot[1, ],
live_in_super = samps_dgp_NB$live_in_super[1, ],
traps = samps_dgp_NB$traps[1, ],
complaints = samps_dgp_NB$complaints[1, ]
)
```
Compile the inferential model.
```{r , cache=TRUE, results="hide", message=FALSE}
comp_model_NB <- stan_model('stan_programs/multiple_NB_regression.stan')
```
Now we run our NB regression over the fake data and extract the samples to
examine posterior predictive checks and to check whether we've sufficiently
recovered our known parameters, $\text{alpha}$ $\texttt{beta}$, .
```{r runNBoverfake, message=FALSE, warning=FALSE}
fitted_model_NB <- sampling(comp_model_NB, data = stan_dat_fake_NB,
chains = 4, cores = 4)
posterior_alpha_beta_NB <-
as.matrix(fitted_model_NB,
pars = c('alpha',
'beta',
'beta_super',
'inv_phi')
)
```
Construct the vector of true values from your simulated dataset and compare to
the recovered parameters.
```{r}
true_alpha_beta_NB <-
c(samps_dgp_NB$alpha,
samps_dgp_NB$beta,
samps_dgp_NB$beta_super,
samps_dgp_NB$inv_phi
)
mcmc_recover_hist(posterior_alpha_beta_NB, true = true_alpha_beta_NB)
```
### Fit to real data and check the fit
```{r runNB}
fitted_model_NB <- sampling(comp_model_NB, data = stan_dat_simple)
samps_NB <- rstan::extract(fitted_model_NB)
```
Let's look at our predictions vs. the data.
```{r ppc-full}
y_rep <- samps_NB$y_rep
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])
```
It appears that our model now captures both the number of small counts better
as well as the tails.
Let's check if the negative binomial model does a better job capturing
the number of zeros:
```{r}
ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero")
```
These look OK, but let's look at the standardized residual plot.
```{r}
mean_inv_phi <- mean(samps_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_resid <- (stan_dat_simple$complaints - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)
```
Looks OK, but we still have some very large _standardized_ residuals. This might be
because we are currently ignoring that the data are clustered by buildings, and
that the probability of roach issue may vary substantially across buildings.
```{r}
ppc_rootogram(stan_dat_simple$complaints, yrep = y_rep)
```
The rootogram now looks much more plausible. We can tell this because now the
expected number of complaints matches much closer to the observed number of
complaints. However, we still have some larger counts that appear
to be outliers for the model.
Check predictions by number of traps:
```{r}
ppc_intervals(
y = stan_dat_simple$complaints,
yrep = y_rep,
x = stan_dat_simple$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
```
We haven't used the fact that the data are clustered by building yet. A posterior
predictive check might elucidate whether it would be a good idea to add the building
information into the model.
```{r ppc-group_means}
ppc_stat_grouped(
y = stan_dat_simple$complaints,
yrep = y_rep,
group = pest_data$building_id,
stat = 'mean',
binwidth = 0.2
)
```
We're getting plausible predictions for most building means but some are
estimated better than others and some have larger uncertainties than we might
expect. If we explicitly model the variation across buildings we may be able to
get much better estimates.
## Hierarchical modeling
### Modeling varying intercepts for each building
Let's add a hierarchical intercept parameter, $\alpha_b$ at the building level
to our model.
$$
\text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\
\lambda_{b,t} = \exp{(\eta_{b,t})} \\
\eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \beta_{\rm super}\, {\rm super}_b + \text{log_sq_foot}_b \\
\mu_b \sim \text{Normal}(\alpha, \sigma_{\mu})
$$
In our Stan model, $\mu_b$ is the $b$-th element of the vector
$\texttt{mu}$ which has one element per building.
One of our predictors varies only by building, so we can rewrite the above model
more efficiently like so:
$$
\eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \text{log_sq_foot}_b\\
\mu_b \sim \text{Normal}(\alpha + \beta_{\text{super}} \, \text{super}_b , \sigma_{\mu})
$$
We have more information at the building level as well, like the average age of
the residents, the average age of the buildings, and the average per-apartment
monthly rent so we can add that data into a matrix called
`building_data`, which will have one row per building and four columns:
* `live_in_super`
* `age_of_building`
* `average_tentant_age`
* `monthly_average_rent`
We'll write the Stan model like:
$$
\eta_{b,t} = \alpha_b + \beta \, {\rm traps} + \text{log_sq_foot}\\
\mu \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \,\sigma_{\mu})
$$
### Prepare building data for hierarchical
We'll need to do some more data prep before we can fit our models. Firstly to
use the building variable in Stan we will need to transform it from a factor
variable to an integer variable.
```{r prep-data}
N_months <- length(unique(pest_data$date))
# Add some IDs for building and month
pest_data <- pest_data %>%
mutate(
building_fac = factor(building_id, levels = unique(building_id)),
building_idx = as.integer(building_fac),
ids = rep(1:N_months, N_buildings),
mo_idx = lubridate::month(date)
)
# Center and rescale the building specific data
building_data <- pest_data %>%
select(
building_idx,
live_in_super,
age_of_building,
total_sq_foot,
average_tenant_age,
monthly_average_rent
) %>%
unique() %>%
arrange(building_idx) %>%
select(-building_idx) %>%
scale(scale=FALSE) %>%
as.data.frame() %>%
mutate( # scale by constants
age_of_building = age_of_building / 10,
total_sq_foot = total_sq_foot / 10000,
average_tenant_age = average_tenant_age / 10,
monthly_average_rent = monthly_average_rent / 1000
) %>%
as.matrix()
# Make data list for Stan
stan_dat_hier <-
with(pest_data,
list(complaints = complaints,
traps = traps,
N = length(traps),
J = N_buildings,
M = N_months,
log_sq_foot = log(pest_data$total_sq_foot/1e4),
building_data = building_data[,-3],
mo_idx = as.integer(as.factor(date)),
K = 4,
building_idx = building_idx
)
)
```
### Compile and fit the hierarchical model
Let's compile the model.
```{r comp-NB-hier, cache=TRUE, results="hide", message=FALSE}
comp_model_NB_hier <- stan_model('stan_programs/hier_NB_regression.stan')
```
Fit the model to data.
```{r run-NB-hier}
fitted_model_NB_hier <-
sampling(
comp_model_NB_hier,
data = stan_dat_hier,
chains = 4,
cores = 4,
iter = 4000
)
```
### Diagnostics
We get a bunch of warnings from Stan about divergent transitions, which is
an indication that there may be regions of the posterior that have not been
explored by the Markov chains.
Divergences are discussed in more detail in the course slides as well as
the **bayesplot** (MCMC diagnostics vignette)[http://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html]
and [_A Conceptual Introduction to Hamiltonian Monte Carlo_](https://arxiv.org/abs/1701.02434).
In this
example we will see that we have divergent transitions because we need to
reparameterize our model - i.e., we will retain the overall structure of the
model, but transform some of the parameters so that it is easier for Stan to
sample from the parameter space. Before we go through exactly how to do this
reparameterization, we will first go through what indicates that this is
something that reparameterization will resolve. We will go through:
1. Examining the fitted parameter values, including the effective sample size
2. Traceplots and scatterplots that reveal particular patterns in locations of the divergences.
First let's extract the fits from the model.
```{r}
samps_hier_NB <- rstan::extract(fitted_model_NB_hier)
```
Then we print the fits for the parameters that are of most interest.
```{r print-NB-hier}
print(fitted_model_NB_hier, pars = c('sigma_mu','beta','alpha','phi','mu'))
```
You can see that the effective samples are quite low for many of the parameters
relative to the total number of samples. This alone isn't indicative of the need
to reparameterize, but it indicates that we should look further at the trace
plots and pairs plots. First let's look at the traceplots to see if the divergent
transitions form a pattern.
```{r}
# use as.array to keep the markov chains separate for trace plots
mcmc_trace(
as.array(fitted_model_NB_hier,pars = 'sigma_mu'),
np = nuts_params(fitted_model_NB_hier),
window = c(500,1000)
)
```
Looks as if the divergent parameters, the little red bars underneath the traceplots
correspond to samples where the sampler gets stuck at one parameter value for $\sigma_\mu$.
```{r}
# assign to object so we can compare to another plot later
scatter_with_divs <- mcmc_scatter(
as.array(fitted_model_NB_hier),
pars = c("mu[4]", 'sigma_mu'),
transform = list('sigma_mu' = "log"),
np = nuts_params(fitted_model_NB_hier)
)
scatter_with_divs
```
What we have here is a cloud-like shape, with most of the divergences clustering
towards the bottom. We'll see a bit later that we actually want this to look more
like a funnel than a cloud, but the divergences are indicating that the sampler
can't explore the narrowing neck of the funnel.
One way to see why we should expect some version of a funnel is to look at some
simulations from the prior, which we can do without MCMC and thus with no risk
of sampling problems:
```{r}
N_sims <- 1000
log_sigma <- rep(NA, N_sims)
theta <- rep(NA, N_sims)
for (j in 1:N_sims) {
log_sigma[j] <- rnorm(1, mean = 0, sd = 1)
theta[j] <- rnorm(1, mean = 0, sd = exp(log_sigma[j]))
}
draws <- cbind("mu" = theta, "log(sigma_mu)" = log_sigma)
mcmc_scatter(draws)
```
Of course, if the data is at all informative we shouldn't expect the posterior
to look exactly like the prior. But unless the data is incredibly informative
about the parameters and the posterior concentrates away from the narrow neck of
the funnel, the sampler is going to have to confront the funnel geometry. (See
the [Visual MCMC
Diagnostics](http://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html)
vignette for more on this.)
Another way to look at the divergences is via a parallel coordinates plot:
```{r}
parcoord_with_divs <- mcmc_parcoord(
as.array(fitted_model_NB_hier, pars = c("sigma_mu", "mu")),
np = nuts_params(fitted_model_NB_hier)
)
parcoord_with_divs
```
Again, we see evidence that our problems concentrate when $\texttt{sigma_mu}$ is small.
### Reparameterize and recheck diagnostics
Instead, we should use the non-centered parameterization for $\mu_b$. We
define a vector of auxiliary variables in the parameters block,
$\texttt{mu_raw}$ that is given a $\text{Normal}(0, 1)$ prior in the model
block. We then make $\texttt{mu}$ a transformed parameter:
We can reparameterize the random intercept $\mu_b$, which is distributed:
$$
\mu_b \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \sigma_{\mu})
$$
```
transformed parameters {
vector[J] mu;
mu = alpha + building_data * zeta + sigma_mu * mu_raw;
}
```
This gives $\texttt{mu}$ a
$\text{Normal}(\alpha + \texttt{building_data}\, \zeta, \sigma_\mu)$
distribution, but it decouples the dependence of the density of each element of
$\texttt{mu}$ from $\texttt{sigma_mu}$ ($\sigma_\mu$).
hier_NB_regression_ncp.stan uses the non-centered parameterization for
$\texttt{mu}$. We will examine the effective sample size of the fitted model to