colloq_discretization.rmd

---
title: Nothing is zero in a multifactorial world \newline (and why that matters)
author: "Ben Bolker \\newline McMaster University"
date: "17 September 2021"
bibliography: discrete.bib
csl: reflist2.csl
output:
 beamer_presentation:
  includes:
   in_header: ./header.tex
   after_body: ./suffix.tex
  slide_level: 2
  keep_tex: true
---

<!-- need to rmarkdown::render(), make rule doesn't get the header properly! -->
<!-- 
tex hacks required: remove empty frame at beginning; break line 
  in title (Burnham ref; add \framebreak manually to refs (ugh) -->
<!-- 
apa.csl is a slightly hacked version of APA 
  (modified for "et al" after 2 authors in text)
  -->
<!-- blockquote:
  https://css-tricks.com/snippets/css/simple-and-nice-blockquote-styling/ -->
<!-- center:
    https://www.w3schools.com/howto/howto_css_image_center.asp -->
<!-- .refs is style for reference page (small text) -->
<style>
.refs {
font-size: 10px;
}
.sm_block {
 font-size: 20px;
}
h2 { 
 color: #3399ff;		
}
h3 { 
 color: #3399ff;		
}
.title-slide {
   background-color: #55bbff;
   }
blockquote {
  background: #f9f9f9;
  border-left: 10px solid #ccc;
  margin: 1.5em 10px;
  padding: 0.5em 10px;
  quotes: "\201C""\201D"
}
blockquote:before {
  color: #ccc;
  content: open-quote;
  font-size: 4em;
  line-height: 0.1em;
  margin-right: 0.25em;
  vertical-align: -0.4em;
}
blockquote p {
  display: inline;
}
.center {
    display: block;
    margin-left: auto;
    margin-right: auto;
    width: 50%;
}
</style>
<!--    content: url(https://i.creativecommons.org/l/by-sa/4.0/88x31.png)
>
<!-- Limit image width and height -->
<style type="text/css">
img {     
  max-height: 560px;     
  max-width: 700px; 
}
div#before-column p.forceBreak {
	break-before: column;
}

div#after-column p.forceBreak {
	break-after: column;
}
</style>
```{r setup,echo=FALSE,message=FALSE, include=FALSE}
library(tidyverse)
library("ggplot2"); theme_set(theme_classic())
library("colorspace")
library("reshape2")
library("ggExtra")
library("MASS")
library("knitr")
opts_chunk$set(echo=FALSE,fig.width=4,fig.height=4, out.height="0.8 \\textheight")
## https://www.mattblackwell.org/blog/2021/04/06/rmd-overlays/
hook_plot <- knitr::knit_hooks$get("plot")
knitr::knit_hooks$set(
  plot = function(x, options) {
    if (is.null(options$overlay.plot)) {
      return(hook_plot(x, options))
    } else {
      i <- options$fig.cur
      hand <- as.numeric(ifelse(i == options$overlay.plot, 1, 0))
      bf <- paste0("\\only<", i, "| handout:", hand, ">{")
      paste(c(bf, knitr::hook_plot_tex(x, options), "}"), collapse = "\n")
    }
  }
)
```
## acknowledgements

money: NSERC

ideas: Jonathan Dushoff, Marm Kilpatrick, Brian McGill, Daniel Park, Daniel Turek

## what is a multifactorial system?

- many processes contribute to observed patterns  
(e.g. all of biology/psychology/economics/epidemiology ...)
- quantify *how* each process affects the system,  
rather than testing *whether* we can detect its impact

## what are we trying to do?

- **prediction**: what do we expect in a specified scenario?
- **inference**: what factors are affecting observed outcomes, and how much?
<!--   - (or: prediction/estimation of effects \cemph{with appropriate uncertainty bounds}) -->

This talk will focus on inference.

## why do we pretend some effects are zero?

\cemph{point hypotheses} = *counterfactual null hypotheses*
 
- e.g.: "the delta variant has the same $R_0$ as earlier variants"; "masks do not reduce COVID transmission"
- never\footnote[1]{what, never?} true in multifactorial systems ...

<!--
more generally we might want to test among \emph{discrete} hypotheses:

- "strong inference" [@platt_strong_1964]:  
(binary, discrete) alternative hypotheses 
- testing among *many* discrete hypotheses  
[@taper_evidential_2015]

-->

## Chamberlin's method of multiple working hypotheses

<blockquote>
... the measure of participation of each [process] must be determined before a satisfactory elucidation can be reached. The full solution therefore involves not only a recognition of multiple participation but an estimate of the measure and mode of each participation ...
</blockquote>

(Chamberlin 1890 in @raup_method_1995)

## discretomania

- null hypothesis testing as measure of \emph{clarity}  
(\bemph{OK} @dushoff_i_2019)
- "feature/variable selection" in data science  
(\bemph{OK but maybe not optimal?})
- stepwise regression (\cemph{not OK!})

Why do people use stepwise regression, and why is it a problem?

<!-- animation?

https://gist.github.com/mattblackwell/0d26d5c8f61f231570d61ccd62fe511f 
https://www.mattblackwell.org/blog/2021/04/06/rmd-overlays/

-->

## under/overfitting (1) [@forsythe_computer_1977]

```{r overfit_calc}
## http://www.mathworks.com/products/matlab/demos.html?
##  file=/products/demos/shipping/matlab/census.html
dd <- data.frame(time = seq(1900,2000,by=10),
           pop = c(75.995,91.972,105.711,123.203,131.669,
                   150.697,179.323,203.212,226.505,249.633,281.422)) %>%
    mutate(sct=(time-1950)/50)
predframe <- tibble(time=seq(1900,2020, length=101)) %>% mutate(sct=(time-1950)/50)
orders = c(1:3,8)
pred <- purrr::map_dfr(setNames(orders, orders),
               ~ {
                   m <- lm(pop ~ poly(sct, ., raw=TRUE), data=dd)
                   data.frame(predframe,
                              pop = predict(m, newdata=predframe))
               },
               .id = "order")
```

```{r overfit_plot, overlay.plot=4, fig.width=5}
gg_full <- ggplot(pred, aes(time, pop)) +
    geom_point(data = dd, size=3) +
    geom_line(aes(colour=order)) +
    coord_cartesian(ylim=c(0,350),
                    xlim=c(1890, 2025), expand=FALSE) +
    labs(x="year", y="US population (millions)")
for (i in seq_along(orders)) {
    print(gg_full %+% filter(pred, order %in% orders[1:i]))
}
```

## under/overfitting (2) {.columns-2 .build}

\begincols
\begincol{0.5\textwidth}

- \onslide<1-> avoid omitting important predictors
- \onslide<2-> avoid overfitting
- \onslide<3-> $\approx$ optimize \cemph{bias-variance tradeoff}

\endcol
\begincol{0.5\textwidth}

\onslide<1->
<!-- \pause -->

\includegraphics[width=\textwidth]{pix/330px-The_Three_Bears_-_Project_Gutenberg_eText_17034.jpg}

\tiny Rackham 1837

\endcol
\endcols

## no free lunches

- data-based approaches to selection $\to$  
\cemph{Texas sharpshooter fallacy}
- setting a subset of model coefficients/effects to zero
- can we do better?
- \cemph{model averaging}: instead of picking one model, combine them (a form of *shrinkage estimation*)

\begin{center}
\includegraphics[height=1.5in]{pix/hankin.png} \\
\tiny C. Hankin
\end{center}

## multimodel averaging  (MMA) [@burnham_model_2002]

- one of many forms of model averaging: popular with ecologists
- information-theoretic motivation (Kullback-Leibler distance)
- typical approach
    - fit full model
	- fit all (or many) submodels
	- compute weights
    - compute model-averaged point estimates and CIs

## why *not* use MMA?

three problems:

- conceptual
- computational
- inferential

## conceptual problem: discretization {.build .columns-2}

\begincols

\begincol{0.5\textwidth}

- information-theoretic approaches framed as implementing Chamberlin MMWH  
[@elliott_revisiting_2007]
- but IT/MMA works with **discrete** hypotheses/models  
(counterfactual/straw men)
- ¿ is this actually a problem ?

\endcol
\begincol{0.5\textwidth}

\pause
\includegraphics[width=\textwidth]{pix/eyam_ternary.png}

{\small Estimated contribution of plague transmission modes in Eyam 1665: SW Park}

\endcol
\endcols

## computational problem: efficiency {.columns-2 .build}

\begincols
\begincol{0.5\textwidth}

\newcommand{\bigO}{{\mathcal{O}}}

- model average may mean fitting **lots** of models: for $K$ variables with $n$ observations, $\bigO(2^K K^2 n)$
- we do have lots of computers ...

\endcol
\begincol{0.5\textwidth}

\pause

\includegraphics[width=0.75\textwidth]{pix/bitcoin_nyc.png}

\endcol
\endcols

## inferential problem: undercoverage {.build}

- lots of ways to construct MMA CIs [@burnham_model_2002;@fletcher2012model;@kabaila_model-averaged_2016]
- MMA CIs are generally **too narrow**  
[@turek2013frequentist;@kabaila_model-averaged_2016;@dormann_model_2018] but cf. @burnham_model_2002
- on the other hand, (properly constructed) ridge-regression CI width $\geq$ least-squares CI [@obenchain_classical_1977]


## undercoverage (2)

- **no free lunch**; can we ever gain certainty by (data-driven) shrinkage?
- **coverage**: proportion of time that the confidence interval includes the true value (should be 95% for 95% CIs)

\includegraphics[width=0.9\textwidth]{pix/dormann_coverage.png}

@dormann_model_2018, Figure 5

## better shrinkage estimators

- most machine learning approaches use shrinkage somehow
- lasso & ridge regression; random forests; Bayesian priors
- complexity may be $\bigO(K^2 n)$ or $\bigO(K n \log(n))$ (vs $\bigO(2^K K^2 n)$ for brute force)
[@hardy_machine_2017; @louppe_understanding_2014]  

## example: penalized regression

```{r pen_reg}
sctvec <- seq(min(dd$sct), max(dd$sct), length = 101)
## fit0 <- lm(pop ~ splines::bs(sct, 10), data=dd)
## pv <- attr(terms(fit0), "predvars")
## ss <- mgcv::smoothCon(s(sct, bs = "bs", k=11), data=data.frame(sct=sctvec))
## mm <- ss[[1]]$X
## mm <- cbind(1,eval(pv[[3]], envir=list2env(list(sct = sctvec))))
fit <- mgcv::gam(pop ~ s(sct, bs = "bs", k=11), data=dd)
mm <- predict(fit, type = "lpmatrix", newdata=data.frame(sct=sctvec))
colnames(mm) <- seq(ncol(mm))-1
spldat <- (mm
    %>% as_tibble()
    %>% mutate(time = sctvec*50+1950, sct = sctvec)
    %>% pivot_longer(matches("^[0-9]"), names_to="spl_i",
                     names_transform = as.numeric)
    %>% mutate(across(spl_i, ~factor(., levels = unique(.))))
)

ctab <- tibble(spl_i = factor(colnames(mm), levels = colnames(mm)),
               beta = coef(fit))
spldat2 <- (spldat
    %>% full_join(ctab, by = "spl_i")
    %>% filter(spl_i != "0")
    %>% mutate(pop = coef(fit)[1] + beta*value)
)
gg_full <- ggplot(spldat2, aes(time, pop)) +
    geom_point(data = dd, size=3) +
    geom_smooth(data = dd, method="gam", formula= y~s(x, bs = "bs", k=11)) +
    geom_line(aes(colour=spl_i)) +
    theme(legend.position="none") +
    labs(x="year", y="US population (millions)")
print(gg_full)
```

## inference is still hard!

- **high-dimensional inference** [@dezeure_high-dimensional_2015]
- tries to account for data-dependent shrinkage
- strong assumptions (big data, sparse effects) 

## example: evolution of fish accessory glands

- 12 parameters (rates of evolutionary gain/loss)
- current method [@pagel_bayesian_2006]; compare Bell(12) = 4213597 models
- or we could just estimate the parameters!

\begincols
\begincol{0.66\textwidth}

\includegraphics[width=\textwidth]{pix/phylo.png}

\endcol
\begincol{0.33\textwidth}

\pause

\includegraphics[width=\textwidth]{pix/skinny_contrasts.png}

\endcol
\endcols

## conclusions: what should you do?

- for **inference**:
    - use the full model
    - *a priori* model reduction [@harrell_regression_2001]
- for **prediction**:
    - use appropriate shrinkage estimators, but beware CIs

## References {.refs .columns-2 .allowframebreaks}

\tiny