time_series.Rmd

```{r echo = FALSE, message = FALSE, warning = FALSE}
# run setup script
source("_common.R")

library(ggridges)
library(lubridate)
library(ggrepel)
```

# Visualizing time series and other functions of an independent variable {#time-series}

The preceding chapter discussed scatter plots, where we plot one quantitative variable against another. A special case arises when one of the two variables can be thought of as time, because time imposes additional structure on the data. Now the data points have an inherent order; we can arrange the points in order of increasing time and define a predecessor and successor for each data point. We frequently want to visualize this temporal order and we do so with line graphs. Line graphs are not limited to time series, however. They are appropriate whenever one variable imposes an ordering on the data. This scenario arises also, for example, in a controlled experiment where a treatment variable is purposefully set to a range of different values. If we have multiple variables that depend on time, we can either draw separate line plots or we can draw a regular scatter plot and then draw lines to connect the neighboring points in time.


## Individual time series

As a first demonstration of a time series, we will consider the pattern of monthly preprint submissions in biology. Preprints are scientific articles that researchers post online before formal peer review and publication in a scientific journal. The preprint server bioRxiv, which was founded in November 2013 specifically for researchers working in the biological sciences, has seen substantial growth in monthly submissions since. We can visualize this growth by making a form of scatter plot (Chapter \@ref(visualizing-associations)) where we draw dots representing the number of submissions in each month (Figure \@ref(fig:biorxiv-dots)).

(ref:biorxiv-dots) Monthly submissions to the preprint server bioRxiv, from its inception in November 2014 until April 2018. Each dot represents the number of submissions in one month. There has been a steady increase in submission volume throughout the entire 4.5-year period. Data source: Jordan Anaya, http://www.prepubmed.org/

```{r biorxiv-dots, fig.cap = '(ref:biorxiv-dots)'}
preprint_growth %>% filter(archive == "bioRxiv") %>%
  filter(count > 0) -> biorxiv_growth

ggplot(biorxiv_growth, aes(date, count)) + 
  #geom_point(color = "#0072B2") +
  geom_point(color = "white", fill = "#0072B2", shape = 21, size = 2) +
  scale_y_continuous(limits = c(0, 1600), expand = c(0, 0),
                name = "preprints / month") + 
  scale_x_date(name = "year") +
  theme_dviz_open() +
  theme(plot.margin = margin(7, 7, 3, 1.5))
```

There is an important difference however between Figure \@ref(fig:biorxiv-dots) and the scatter plots discussed in Chapter \@ref(visualizing-associations). In Figure \@ref(fig:biorxiv-dots), the dots are spaced evenly along the *x* axis, and there is a defined order among them. Each dot has exactly one left and one right neighbor (except the leftmost and rightmost points which have only one neighbor each). We can visually emphasize this order by connecting neighboring points with lines (Figure \@ref(fig:biorxiv-dots-line)). Such a plot is called a *line graph*.

(ref:biorxiv-dots-line) Monthly submissions to the preprint server bioRxiv, shown as dots connected by lines. The lines do not represent data but are only meant as a guide to the eye. By connecting the individual dots with lines, we emphasize that there is an order between the dots, each dot has exactly one neighbor that comes before and one that comes after. Data source: Jordan Anaya, http://www.prepubmed.org/

```{r biorxiv-dots-line, fig.cap = '(ref:biorxiv-dots-line)'}
ggplot(biorxiv_growth, aes(date, count)) +
  geom_line(size = 0.5, color = "#0072B2") + 
  geom_point(color = "white", fill = "#0072B2", shape = 21, size = 2) +
  scale_y_continuous(limits = c(0, 1600), expand = c(0, 0),
                name = "preprints / month") + 
  scale_x_date(name = "year") +
  theme_dviz_open() +
  theme(plot.margin = margin(7, 7, 3, 1.5))
```

Some people object to drawing lines between points because the lines do not represent observed data. In particular, if there are only a few observations spaced far apart, had observations been made at intermediate times they would probably not have fallen exactly onto the lines shown. Thus, in a sense, the lines correspond to made-up data. Yet they may help with perception when the points are spaced far apart or are unevenly spaced. We can somewhat resolve this dilemma by pointing it out in the figure caption, for example by writing "lines are meant as a guide to the eye" (see caption of Figure \@ref(fig:biorxiv-dots-line)).

Using lines to represent time series is generally accepted practice, however, and frequently the dots are omitted altogether (Figure \@ref(fig:biorxiv-line)). Without dots, the figure places more emphasis on the overall trend in the data and less on individual observations. A figure without dots is also visually less busy. In general, the denser the time series, the less important it is to show individual observations with dots. For the preprint dataset shown here, I think omitting the dots is fine.

(ref:biorxiv-line) Monthly submissions to the preprint server bioRxiv, shown as a line graph without dots. Omitting the dots emphasizes the overall temporal trend while de-emphasizing individual observations at specific time points. It is particularly useful when the time points are spaced very densely. Data source: Jordan Anaya, http://www.prepubmed.org/

```{r biorxiv-line, fig.cap = '(ref:biorxiv-line)'}
ggplot(biorxiv_growth, aes(date, count)) + geom_line(color = "#0072B2", size = 1) +
  scale_y_continuous(limits = c(0, 1600), expand = c(0, 0),
                name = "preprints / month") + 
  scale_x_date(name = "year") +
  theme_dviz_open() +
  theme(plot.margin = margin(7, 7, 3, 1.5))
```

We can also fill the area under the curve with a solid color (Figure \@ref(fig:biorxiv-line-area)). This choice further emphasizes the overarching trend in the data, because it visually separates the area above the curve from the area below. However, this visualization is only valid if the *y* axis starts at zero, so that the height of the shaded area at each time point represents the data value at that time point.

(ref:biorxiv-line-area) Monthly submissions to the preprint server bioRxiv, shown as a line graph with filled area underneath. By filling the area under the curve, we put even more emphasis on the overarching temporal trend than if we just draw a line (Figure \@ref(fig:biorxiv-line)). Data source: Jordan Anaya, http://www.prepubmed.org/

```{r biorxiv-line-area, fig.cap = '(ref:biorxiv-line-area)'}
ggplot(biorxiv_growth, aes(date, height = count, y = 0)) + 
  geom_ridgeline(color = "#0072B2", fill = "#0072B240", size = 0.75) +
  scale_y_continuous(limits = c(0, 1600), expand = c(0, 0),
                name = "preprints / month") + 
  scale_x_date(name = "year") +
  theme_dviz_open() +
  theme(plot.margin = margin(7, 7, 3, 1.5))
```


## Multiple time series and dose--response curves

We often have multiple time courses that we want to show at once. In this case, we have to be more careful in how we plot the data, because the figure can become confusing or difficult to read. For example, if we want to show the monthly submissions to multiple preprint servers, a scatter plot is not a good idea, because the individual time courses run into each other (Figure \@ref(fig:bio-preprints-dots)). Connecting the dots with lines alleviates this issue (Figure \@ref(fig:bio-preprints-lines)).

(ref:bio-preprints-dots) Monthly submissions to three preprint servers covering biomedical research: bioRxiv, the q-bio section of arXiv, and PeerJ Preprints. Each dot represents the number of submissions in one month to the respective preprint server. This figure is labeled "bad" because the three time courses visually interfere with each other and are difficult to read. Data source: Jordan Anaya, http://www.prepubmed.org/

```{r bio-preprints-dots, fig.cap = '(ref:bio-preprints-dots)'}
preprint_growth %>% filter(archive %in% c("bioRxiv", "arXiv q-bio", "PeerJ Preprints")) %>%
  filter(count > 0) %>%
  mutate(archive = factor(archive, levels = c("bioRxiv", "arXiv q-bio", "PeerJ Preprints")))-> preprints

p <- ggplot(preprints, aes(date, count, color = archive, fill = archive, shape = archive)) + 
  geom_point(color = "white", size = 2) +
  scale_shape_manual(values = c(21, 22, 23),
                     name = NULL) + 
  scale_y_continuous(limits = c(0, 600), expand = c(0, 0),
                name = "preprints / month") + 
  scale_x_date(name = "year",
               limits = c(min(biorxiv_growth$date), ymd("2017-01-01"))) +
  scale_color_manual(values = c("#0072b2", "#D55E00", "#009e73"),
                     name = NULL) +
  scale_fill_manual(values = c("#0072b2", "#D55E00", "#009e73"),
                     name = NULL) +
  theme_dviz_open() +
  theme(legend.title.align = 0.5,
        legend.position = c(0.1, .9),
        legend.just = c(0, 1),
        plot.margin = margin(14, 7, 3, 1.5))

stamp_bad(p)
```

(ref:bio-preprints-lines) Monthly submissions to three preprint servers covering biomedical research. By connecting the dots in Figure \@ref(fig:bio-preprints-dots) with lines, we help the viewer follow each individual time course. Data source: Jordan Anaya, http://www.prepubmed.org/

```{r bio-preprints-lines, fig.cap = '(ref:bio-preprints-lines)'}
ggplot(preprints, aes(date, count, color = archive, fill = archive, shape = archive)) + 
  geom_line(size = 0.75) + geom_point(color = "white", size = 2) +
  scale_y_continuous(limits = c(0, 600), expand = c(0, 0),
                name = "preprints / month") + 
  scale_x_date(name = "year",
               limits = c(min(biorxiv_growth$date), ymd("2017-01-01"))) +
  scale_color_manual(values = c("#0072b2", "#D55E00", "#009e73"),
                     name = NULL) +
  scale_fill_manual(values = c("#0072b2", "#D55E00", "#009e73"),
                     name = NULL) +
  scale_shape_manual(values = c(21, 22, 23),
                     name = NULL) + 
  theme_dviz_open() +
  theme(legend.title.align = 0.5,
        legend.position = c(0.1, .9),
        legend.just = c(0, 1),
        plot.margin = margin(14, 7, 3, 1.5))
```

Figure \@ref(fig:bio-preprints-lines) represents an acceptable visualization of the preprints dataset. However, the separate legend creates unnecessary cognitive load. We can reduce this cognitive load by labeling the lines directly (Figure \@ref(fig:bio-preprints-direct-label)). We have also eliminated the individual dots in this figure, for a result that is much more streamlined and easy to read than the original starting point, Figure \@ref(fig:bio-preprints-dots).

(ref:bio-preprints-direct-label) Monthly submissions to three preprint servers covering biomedical research. By direct labeling the lines instead of providing a legend, we have reduced the cognitive load required to read the figure. And the elimination of the legend removes the need for points of different shapes. Thus, we could streamline the figure further by eliminating the dots. Data source: Jordan Anaya, http://www.prepubmed.org/

```{r bio-preprints-direct-label, fig.cap = '(ref:bio-preprints-direct-label)'}
preprints_final <- filter(preprints, date == ymd("2017-01-01"))

ggplot(preprints) +
  aes(date, count, color = archive, fill = archive, shape = archive) + 
  geom_line(size = 1) + 
  #geom_point(color = "white", size = 2) +
  scale_y_continuous(
    limits = c(0, 600), expand = c(0, 0),
    name = "preprints / month",
    sec.axis = dup_axis(
      breaks = preprints_final$count,
      labels = c("arXiv\nq-bio", "PeerJ\nPreprints", "bioRxiv"),
      name = NULL)
  ) + 
  scale_x_date(name = "year",
               limits = c(min(biorxiv_growth$date), ymd("2017-01-01")),
               expand = expand_scale(mult = c(0.02, 0))) +
  scale_color_manual(values = c("#0072b2", "#D55E00", "#009e73"),
                     name = NULL) +
  scale_fill_manual(values = c("#0072b2", "#D55E00", "#009e73"),
                     name = NULL) +
  scale_shape_manual(values = c(21, 22, 23),
                     name = NULL) + 
  coord_cartesian(clip = "off") +
  theme_dviz_open() +
  theme(legend.position = "none") +
  theme(axis.line.y.right = element_blank(),
        axis.ticks.y.right = element_blank(),
        axis.text.y.right = element_text(margin = margin(0, 0, 0, 0)),
        plot.margin = margin(14, 7, 3, 1.5))
```

Line graphs are not limited to time series. They are appropriate whenever the data points have a natural order that is reflected in the variable shown along the *x* axis, so that neighboring points can be connected with a line. This situation arises, for example, in dose--response curves, where we measure how changing some numerical parameter in an experiment (the dose) affects an outcome of interest (the response). Figure \@ref(fig:oats-yield) shows a classic experiment of this type, measuring oat yield in response to increasing amounts of fertilization. The line-graph visualization highlights how the dose--response curve has a similar shape for the three oat varieties considered but differs in the starting point in the absence of fertilization (i.e., some varieties have naturally higher yield than others).


(ref:oats-yield) Dose--response curve showing the mean yield of oats varieties after fertilization with manure. The manure serves as a source of nitrogen, and oat yields generally increase as more nitrogen is available, regardless of variety. Here, manure application is measured in cwt (hundredweight) per acre. The hundredweight is an old imperial unit equal to 112 lbs or 50.8 kg. Data source: @Yates1935

```{r oats-yield, fig.cap = '(ref:oats-yield)'}
MASS::oats %>% 
  # 1 long (UK) cwt == 112 lbs == 50.802345 kg
  mutate(N = 1*as.numeric(sub("cwt", "", N, fixed = TRUE))) %>%
  group_by(N, V) %>%
  summarize(mean = 20 * mean(Y)) %>% # factor 20 converts units to lbs/acre
  mutate(variety = ifelse(V == "Golden.rain", "Golden Rain", as.character(V))) ->
  oats_df

oats_df$variety <- factor(oats_df$variety, levels = c("Marvellous", "Golden Rain", "Victory"))
 
ggplot(oats_df,
       aes(N, mean, color = variety, shape = variety, fill = variety)) +
  geom_line(size = 0.75) + geom_point(color = "white", size = 2.5) +
  scale_y_continuous(name = "mean yield (lbs/acre)") + 
  scale_x_continuous(name = "manure treatment (cwt/acre)") +
  scale_shape_manual(values = c(21, 22, 23),
                     name = "oat variety") + 
  scale_color_manual(values = c("#0072b2", "#D55E00", "#009e73"),
                     name = "oat variety") +
  scale_fill_manual(values = c("#0072b2", "#D55E00", "#009e73"),
                     name = "oat variety") +
  coord_cartesian(clip = "off") +
  theme_dviz_open() +
  theme(legend.title.align = 0.5)

```

## Time series of two or more response variables {#time-series-connected-scatter}

In the preceding examples we dealt with time courses of only a single response variable (e.g., preprint submissions per month or oat yield). It is not unusual, however, to have more than one response variable. Such situations arise commonly in macroeconomics. For example, we may be interested in the change in house prices from the previous 12 months as it relates to the unemployment rate. We may expect that house prices rise when the unemployment rate is low and vice versa.

Given the tools from the preceding subsections, we can visualize such data as two separate line graphs stacked on top of each other (Figure \@ref(fig:house-price-unemploy)). This plot directly shows the two variables of interest, and it is straightforward to interpret. However, because the two variables are shown as separate line graphs, drawing comparisons between them can be cumbersome. If we want to identify temporal regions when both variables move in the same or in opposite directions, we need to switch back and forth between the two graphs and compare the relative slopes of the two curves.

(ref:house-price-unemploy) 12-month change in house prices (a) and  unemployment rate (b) over time, from Jan. 2001 through Dec. 2017. Data sources: Freddie Mac House Prices Index, U.S. Bureau of Labor Statistics.

```{r house-price-unemploy, fig.width = 5*6/4.2, fig.cap = '(ref:house-price-unemploy)'}
# prepare dataset already for next figure
CA_house_prices <- 
  filter(house_prices, state == "California", year(date) > 2000) %>%
  mutate(
    label = ifelse(
      date %in% c(ymd("2005-01-01"), ymd("2007-07-01"), 
                  ymd("2010-01-01"), ymd("2012-07-01"), ymd("2015-01-01")),
      format(date, "%b %Y"), ""),
    nudge_x = case_when(
      label == "Jan 2005" ~ -0.003,
      TRUE ~ 0.003
    ),
    nudge_y = case_when(
      label == "Jan 2005" ~ 0.01,
      label %in% c("Jul 2007", "Jul 2012") ~ 0.01,
      TRUE ~ -0.01
    ),
    hjust = case_when(
      label == "Jan 2005" ~ 1,
      TRUE ~ 0
    )
  )

p1 <- ggplot(CA_house_prices, aes(date, house_price_perc)) +
  geom_line(size = 1, color = "#0072b2") +
  scale_y_continuous(
    limits = c(-0.3, .32), expand = c(0, 0),
    breaks = c(-.3, -.15, 0, .15, .3),
    name = "12-month change\nin house prices", labels = scales::percent_format(accuracy = 1)
  ) + 
  scale_x_date(name = "", expand = c(0, 0)) +
  coord_cartesian(clip = "off") +
  theme_dviz_grid() +
  theme(
    axis.line = element_blank(),
    plot.margin = margin(12, 1.5, 0, 1.5)
  )

p2 <- ggplot(CA_house_prices, aes(date, unemploy_perc/100)) +
  geom_line(size = 1, color = "#0072b2") +
  scale_y_continuous(
    limits = c(0.037, 0.143),
    name = "unemployment\nrate", labels = scales::percent_format(accuracy = 1),
    expand = c(0, 0)
  ) +
  scale_x_date(name = "year", expand = c(0, 0)) +
  theme_dviz_grid() +
  theme(
    axis.line = element_blank(),
    plot.margin = margin(6, 1.5, 3, 1.5)
  )
 
plot_grid(p1, p2, align = 'v', ncol = 1, labels = "auto") 
```

As an alternative to showing two separate line graphs, we can plot the two variables against each other, drawing a path that leads from the earliest time point to the latest (Figure \@ref(fig:house-price-path)). Such a visualization is called a *connected scatter plot*, because we are technically making a scatter plot of the two variables against each other and then are connecting neighboring points. Physicists and engineers often call this a *phase portrait*, because in their disciplines it is commonly used to represent movement in phase space. We have previously encountered connected scatter plots in Chapter \@ref(coordinate-systems-axes), where we plotted the daily temperature normals in Houston, TX, versus those in San Diego, CA (Figure \@ref(fig:temperature-normals-Houston-San-Diego)).

(ref:house-price-path) 12-month change in house prices versus unemployment rate, from Jan. 2001 through Dec. 2017, shown as a connected scatter plot. Darker shades represent more recent months. The anti-correlation seen in Figure \@ref(fig:house-price-unemploy) between the change in house prices and the unemployment rate causes the connected scatter plot to form two counter-clockwise circles. Data sources: Freddie Mac House Price Index, U.S. Bureau of Labor Statistics. Original figure concept: Len Kiefer

```{r house-price-path, fig.asp = 3/4, fig.cap = '(ref:house-price-path)'}

ggplot(CA_house_prices) +
  aes(unemploy_perc/100, house_price_perc, colour = as.numeric(date)) + 
  geom_path(size = 1, lineend = "round") +
  geom_text_repel(
    aes(label = label), point.padding = .2, color = "black",
    min.segment.length = 0, size = 12/.pt,
    hjust = CA_house_prices$hjust,
    nudge_x = CA_house_prices$nudge_x,
    nudge_y = CA_house_prices$nudge_y,
    direction = "y",
    family = dviz_font_family
  ) +
  scale_x_continuous(
    limits = c(0.037, 0.143),
    name = "unemployment rate", labels = scales::percent_format(accuracy = 1),
    expand = c(0, 0)
  ) +
  scale_y_continuous(
    limits = c(-0.315, .315), expand = c(0, 0),
    breaks = c(-.3, -.15, 0, .15, .3),
    name = "12-month change in house prices", labels = scales::percent_format(accuracy = 1)
  ) + 
  scale_colour_gradient(low = "#E7F0FF", high = "#035B8F") + #"#0072b2") +
  guides(colour = FALSE) +
  coord_cartesian(clip = "off") +
  theme_dviz_grid() +
  theme(
    axis.ticks.length = unit(0, "pt"),
    plot.margin = margin(21, 14, 3.5, 1.5))
```

In a connected scatter plot, lines going in the direction from the lower left to the upper right represent correlated movement between the two variables (as one variable grows, so does the other), and lines going in the perpendicular direction, from the upper left to the lower right, represent anti-correlated movement (as one variable grows, the other shrinks). If the two variables have a somewhat cyclic relationship, we will see circles or spirals in the connected scatter plot. In Figure \@ref(fig:house-price-path), we see one small circle from 2001 through 2005 and one large circle for the remainder of the time course.


When drawing a connected scatter plot, it is important that we indicate both the direction  and the temporal scale of the data. Without such hints, the plot can turn into meaningless scribble (Figure \@ref(fig:house-price-path-bad)). I am using here (in Figure \@ref(fig:house-price-path)) a gradual darkening of the color to indicate direction. Alternatively, one could draw arrows along the path.

(ref:house-price-path-bad) 12-month change in house prices versus unemployment rate, from Jan. 2001 through Dec. 2017. This figure is labeled "bad" because without the date markers and color shading of Figure \@ref(fig:house-price-path), we can see neither the direction nor the speed of change in the data. Data sources: Freddie Mac House Prices Index, U.S. Bureau of Labor Statistics.

```{r house-price-path-bad, fig.asp = 3/4, fig.cap = '(ref:house-price-path-bad)'}

p <- ggplot(CA_house_prices) +
  aes(unemploy_perc/100, house_price_perc) + 
  geom_path(size = 1, lineend = "round", color = "#0072b2") +
  scale_x_continuous(
    limits = c(0.037, 0.143),
    name = "unemployment rate", labels = scales::percent_format(accuracy = 1),
    expand = c(0, 0)
  ) +
  scale_y_continuous(
    limits = c(-0.315, .315), expand = c(0, 0),
    breaks = c(-.3, -.15, 0, .15, .3),
    name = "12-month change in house prices", labels = scales::percent_format(accuracy = 1)
  ) + 
  coord_cartesian(clip = "off") +
  theme_dviz_grid() +
  theme(
    axis.ticks.length = unit(0, "pt"),
    plot.margin = margin(21, 14, 3.5, 1.5))

stamp_bad(p)
```

Is it better to use a connected scatter plot or two separate line graphs? Separate line graphs tend to be easier to read, but once people are used to connected scatter plots they may be able to extract certain patterns (such as cyclical behavior with some irregularity) that can be difficult to spot in line graphs. In fact, to me the cyclical relationship between change in house prices and unemployment rate is hard to spot in Figure \@ref(fig:house-price-unemploy), but the counter-clockwise spiral in Figure \@ref(fig:house-price-path) clearly shows it. Research reports that readers are more likely to confuse order and direction in a connected scatter plot than in line graphs and less likely to report correlation [@Haroz_et_al_2016]. On the flip side, connected scatter plots seem to result in higher engagement, and thus such plots may be a effective tools to draw readers into a story [@Haroz_et_al_2016].

Even though connected scatter plots can show only two variables at a time, we can also use them to visualize higher-dimensional datasets. The trick is to apply dimension reduction first (see Chapter \@ref(visualizing-associations)). We can then draw a connected scatterplot in the dimension-reduced space. As an example of this approach, we will visualize a database of monthly observations of over 100 macroeconomic indicators, provided by the Federal Reserve Bank of St. Louis. We perform a principal components analysis (PCA) of all indicators and then draw a connected scatter plot of PC 2 versus PC 1 (Figure \@ref(fig:fred-md-PCA)a) and versus PC 3 (Figure \@ref(fig:fred-md-PCA)b).

(ref:fred-md-PCA) Visualizing a high-dimensional time series as a connected scatter plot in principal components space. The path indicates the joint movement of over 100 macroeconomic indicators from January 1990 to December 2017. Times of recession and recovery are indicated via color, and the end points of the three recessions (March 1991, November 2001, and June 2009) are also labeled. (a) PC 2 versus PC 1. (b) PC 2 versus PC 3. Data source: M. W. McCracken, St. Louis Fed

```{r fred-md-PCA, fig.width = 5*6/4.2, fig.asp = 2*0.618, fig.cap = '(ref:fred-md-PCA)'}
fred_md %>%
  select(-date, -sasdate) %>%
  scale() %>%
  prcomp() -> pca

pca_data <- data.frame(date = fred_md$date, pca$x) %>%
  mutate(
    type = ifelse(
      (ymd("1990-07-01") <= date & date < ymd("1991-03-01")) |
      (ymd("2001-03-01") <= date & date < ymd("2001-11-01")) |
      (ymd("2007-12-01") <= date & date < ymd("2009-06-01")),
      "recession",
      "recovery"
    )
  )

pca_labels <-
  mutate(pca_data,
    label = ifelse(
      date %in% c(ymd("1990-01-01"), ymd("1991-03-01"), ymd("2001-11-01"),
                  ymd("2009-06-01"), ymd("2017-12-01")),
      format(date, "%b %Y"), ""
    )
  ) %>%
  filter(label != "") %>%
  mutate(
    nudge_x = c(.2, -.2, -.2, -.2, .2),
    nudge_y = c(.2, -.2, -.2, -.2, .2),
    hjust = c(0, 1, 1, 1, 0),
    vjust = c(0, 1, 1, 1, 0),
    nudge_x2 = c(.2, .2, .2, -.2, .2),
    nudge_y2 = c(.2, -.2, -1, .2, .2),
    hjust2 = c(0, 0, .2, 1, 0),
    vjust2 = c(0, 1, 1, 1, 0)
  )

colors = darken(c("#D55E00", "#009E73"), c(0.1, 0.1))

p1 <- ggplot(filter(pca_data, date >= ymd("1990-01-01"))) +
  aes(x=PC1, y=PC2, color=type, alpha = date, group = 1) +
  geom_path(size = 1, lineend = "butt") +
  geom_text_repel(
    data = pca_labels,
    aes(label = label),
    alpha = 1,
    point.padding = .2, color = "black",
    min.segment.length = 0, size = 12/.pt,
    family = dviz_font_family,
    nudge_x = pca_labels$nudge_x,
    nudge_y = pca_labels$nudge_y,
    hjust = pca_labels$hjust,
    vjust = pca_labels$vjust
  ) +
  scale_color_manual(
    values = colors,
    name = NULL
  ) +
  scale_alpha_date(range = c(0.45, 1), guide = "none") +
  scale_x_continuous(limits = c(-2, 15.5), name = "PC 1") +
  scale_y_continuous(limits = c(-5, 5), name = "PC 2") +
  theme_dviz_grid(14) +
  theme(
    legend.position = c(1, 1),
    legend.justification = c(1, 1),
    legend.direction = "horizontal",
    legend.box.background = element_rect(color = NA, fill = "white"),
    legend.box.margin = margin(6, 0, 6, 0),
    plot.margin = margin(3, 1.5, 6, 1.5)
  )

p2 <- ggplot(filter(pca_data, date >= ymd("1990-01-01"))) +
  aes(x=PC3, y=PC2, color=type, alpha = date, group = 1) +
  geom_path(size = 1, lineend = "butt") +
  geom_text_repel(
    data = pca_labels,
    aes(label = label),
    alpha = 1,
    point.padding = .2, color = "black",
    min.segment.length = 0, size = 12/.pt,
    family = dviz_font_family,
    nudge_x = pca_labels$nudge_x2,
    nudge_y = pca_labels$nudge_y2,
    hjust = pca_labels$hjust2,
    vjust = pca_labels$vjust2
  ) +
  scale_color_manual(
    values = colors,
    name = NULL
  ) +
  scale_alpha_date(range = c(0.45, 1), guide = "none") +
  scale_x_continuous(limits = c(-6, 8.5), name = "PC 3") +
  scale_y_continuous(limits = c(-5, 5), name = "PC 2") +
  theme_dviz_grid(14) +
  theme(
    legend.position = c(1, 1),
    legend.justification = c(1, 1),
    legend.direction = "horizontal",
    legend.box.background = element_rect(color = NA, fill = "white"),
    legend.box.margin = margin(6, 0, 6, 0),
    plot.margin = margin(6, 1.5, 3, 1.5)
  )

plot_grid(p1, p2, labels = "auto", ncol = 1)
```

Notably, Figure \@ref(fig:fred-md-PCA)a looks almost like a regular line plot, with time running from left to right. This pattern is caused by a common feature of PCA: The first component often measures the overall size of the system. Here, PC 1 approximately measures the overall size of the economy, which rarely decreases over time.

By coloring the connected scatter plot by times of recession and recovery, we can see that recessions are associated with a drop in PC 2 whereas recoveries do not correspond to a clear feature in either PC 1 or PC 2 (Figure \@ref(fig:fred-md-PCA)a). The recoveries do, however, seem to correspond to a drop in PC 3 (Figure \@ref(fig:fred-md-PCA)b). Moreover, in the PC 2 versus PC 3 plot, we see that the line follows the shape of a clockwise spiral. This pattern emphasizes the cyclical nature of the economy, with recessions following recoveries and vice versa.