5-files.Rmd

---
title: "External data files and literate programming"
output: html_document
---

# About

`targets` can reproducibly watch input files, output files, and literate programming documents. This chapter explores how to configure a target to automatically run when a file changes.

# Setup

Run `tar_destroy()` to remove the targets from the previous chapter.

```{r}
library(targets)
tar_destroy()
```

Load this chapter's quiz questions. Try not to peek in advance.

```{r}
source("R/quiz.R")
source("5-files/answers.R")
```

Run the following command to write the required `_targets.R` script to the your working directory.

```{r, echo = FALSE}
file.copy("5-files/initial_targets.R", "_targets.R", overwrite = TRUE)
```

Open `_targets.R` for editing

```{r}
tar_edit()
```

# Review: input data files

As we saw in `3-changes.Rmd`, `targets` can watch files like `data/churn.csv` for changes. Let's review how this works. First, run the full pipeline from start to finish.

```{r}
tar_make()
```

Verify that all targets are now up to date.

```{r}
tar_make()
```

Now, remove the last line of data in the file.

```{r, message = FALSE}
library(tidyverse)
"data/churn.csv" %>%
  read_csv(col_types = cols()) %>%
  head(n = nrow(.) - 1) %>%
  write_csv("data/churn.csv")
```

When we call `tar_make()` again, all targets should rerun because they all depend on the upstream file target `churn_file`.

```{r}
tar_make()
```

How do we configure `churn_file` and downstream targets to rerun when `data/churn.csv` changes?

A. The target's return value needs to be a character vector of file and directory paths. These paths get resolved at runtime, so we do not need to know them before we call `tar_make()`. 
B. In `tar_target()`, set the `format` argument equal to `"file"`. That way, `tar_make()` knows the return value of `churn_file` is a bunch of file paths that need to be watched.
C. Targets directly downstream need to mention the symbol `churn_file` (as opposed to the literal path `"data/churn.csv"`) so `tar_make()` can discover the correct dependency relationships among targets. Always check with `tar_visnetwork()` to verify that your targets are connected properly in the dependency graph.
D. All the above.

```{r}
answer4_review("E")
```

Hint:

```{r}
tar_read(churn_file)
```

# Output files

In `targets`, we configure output files the exact same way. The only difference between input and output files is that output files are created when the target runs.

As an example, open `_targets.R` and create a new file targets `churn_cor`, which saves a CSV file and a plot of correlations (the correlation of each covariate with customer churn in the preprocessed testing data). Functions `compute_cor()` and `plot_cor()` in `5-files/functions.R` do most of the work.

Open `_targets.R` for editing.

```{r}
tar_edit()
```

Enter the new target below into the target list in `_targets.R`.

```{r, eval = FALSE}
tar_target(
  churn_cor, {
    cor <- compute_cor(churn_recipe)
    plot <- plot_cor(cor)
    write_csv(cor, "cor.csv")
    ggsave(plot = plot, filename = "cor.png", width = 8, height = 8)
    # The return value must be a vector of paths to the files we write:
    paste0("cor.", c("csv", "png"))
  },
  format = "file" # Tells targets to track the return value (path) as a file.
)
```

Run the pipeline. Since all previous targets are up to date, only `churn_cor` should run.

```{r}
tar_make()
```

What is the return value of the new `churn_cor` target?

```{r}
tar_read(churn_cor)
```

A. `c("cor.csv", "cor.png")`, a vector of paths to the files produced by the target.
B. A `ggplot` object with the correlation plot.
C. A data frame of correlations.
D. A function called `churn_cor()`.

```{r}
answer4_return("E")
```

Take a look at the new output file `cor.png`. You should see a plot of each variable's correlation with customer churn. Also glance at `cor.csv`, the output dataset with the correlations.

```{r}
read_csv("cor.csv", col_types = cols())
```

Now, delete one of the output files and rerun the pipeline.

```{r}
tmp <- file.remove("cor.png")
tar_make()
```

What happened? Why?

A. All targets reran because we changed a file.
B. No target reran because `targets` does not track the deleted file.
C. Because `churn_cor` is a correctly configured file target, `tar_make()` noticed when `cor.png` changed and automatically reran the target in order to repair the file.
D. `churn_cor` because `targets` always treats character strings as file names.

```{r}
answer4_delete("E")
```

Whenever a single target like `churn_cor` tracks multiple files or directories, `tar_make()` treats all those files as a single unit. The whole target invalidates when one of the files changes, and downstream targets must accept all the files together. When we come to the chapter on branching, we will use the special `tar_files()` function from the `tarchetypes` package to branch over the available files.

# Literate programming
 
Literate programming is the practice of writing code and explanatory prose in the same source file. This R Markdown document is an example. All this time, we have been using literate programming on top of `targets`. But now, we will explore literate programming *within* a target.

## R Markdown on its own

Let's pull an example R Markdown file.

```{r}
tmp <- file.copy("5-files/results.Rmd", "results.Rmd", overwrite = TRUE)
```

Open `results.Rmd` for editing.

```{r}
library(usethis)
edit_file("results.Rmd", open = TRUE)
```

Notice the calls to `tar_load()` and `tar_read()` in active the code chunks. `results.Rmd`. On its own, this report leverages the results of previous targets.

```{r}
# Copy and paste the following directly into the R console.
# This code chunk will not render results.Rmd properly
# if called inside the R notebook.
library(rmarkdown)
render("results.Rmd")
browseURL("results.html")
```

## R Markdown inside a target

We can put `results.Rmd` in a target so it automatically re-renders when its dependencies change. Open `_targets.R` for editing.

```{r}
tar_edit()
```

Write `library(tarchetypes)` at the very top, and write `tar_render(report_step, "results.Rmd")` as a target in the pipeline.

```{r, eval = FALSE}
# Do not run here.
library(tarchetypes)
# Existing setup code goes here.
list(
  # Existing calls to tar_target() stay here.
  tar_render(report_step, "results.Rmd")
)
```

`tar_render()` analyzes `report.Rmd` and constructs a target that depends on the report's `tar_read()`/`tar_load()` dependencies. To see this for yourself, take a look at the dependency graph.

```{r}
tar_visnetwork()
```

Which targets does `report_step` depend on? Why?

A. None. No upstream targets are mentioned in the target's command.
B. `run_relu` because the report calls `tar_read(run_relu)` in an active code chunk.
C. `run_sigmoid` because the report calls `tar_load(run_sigmoid)` in an active code chunk.
D. `run_relu` and `run_sigmoid` because the report calls `tar_read(run_relu)` and `tar_load(run_sigmoid)` in active code chunks.

```{r}
answer4_deps1("E")
```

## Add a dependency

At the bottom of the `report.Rmd`, add an active code chunk to print out the best model (`tar_read(best_model)`). Then, look at the graph again.

```{r}
tar_visnetwork()
```

What changed? Why?

A. `report_step` is disconnected from the other nodes in the graph because we edited the report.
B. The graph did not change because we did not run the report yet.
C. `report_step` now depends on `best_model` in addition to `run_relu` and `run_sigmoid`. All three are mentioned in active code chunks with `tar_load()` and `tar_read()`.
D. `report_step` is only connected to `best_model` now because you just added it as a dependency.

```{r}
answer4_deps2("E")
```

## Run the report

Run the whole pipeline. The newly added `report_step` target should run. 

```{r}
tar_make()
```

Verify that all targets are up to date now.

```{r}
tar_make() # See also tar_outdated()
```

View `results.html`. You should see a print-out of the best model at the bottom.

```{r}
browseURL("results.html")
```

## Remove the output file

Remove the output HTML file.

```{r}
unlink("results.html")
```

Then rerun the pipeline.

```{r}
tar_make()
```

What happened? Why?

A. All targets reran because the file system change.
B. The `report_step` target reran because `tar_render()` reproducibly tracks the output files of `rmarkdown::render()` and helps `tar_make()` respond to changes in `results.html`.
C. The `report_step` target reran because `results.Rmd` changed.
D. No targets reran because output files from R Markdown reports are not reproducibly tracked.

```{r}
answer4_html("E")
```

## Change the R Markdown source

Add some prose anywhere in the body of `results.Rmd`.

```{r}
library(usethis)
edit_file("results.Rmd", open = TRUE)
# Add comments in the report to explain the results.
```

Now, rerun the pipeline.

```{r}
tar_make()
```

What happened? Why?

A. Only `report_step` reran because the R Markdown source file changed and all the other targets stayed up to date.
B. Only `report_step` reran because R Markdown reports always rerun.
C. All targets reran because `results.Rmd` is a dependency of the whole pipeline.
D. No targets reran because the pipeline does not track changes to the R Markdown source.

```{r}
answer4_rmd("E")
```

## Change an upstream dependency

Remove another line of `data/churn.csv`.

```{r, message = FALSE}
"data/churn.csv" %>%
  read_csv(col_types = cols()) %>%
  head(n = nrow(.) - 1) %>%
  write_csv("data/churn.csv")
```

Now, rerun the pipeline.

```{r}
tar_make()
```

Did `report_step` rerun? Why?

A. No, because `data/churn.csv` is not a dependency of `report_step`.
B. No, because the `report.Rmd` does not read `data/churn.csv`.
C. Yes. The R Markdown report is downstream of a target that depends on `data/churn.csv`, and the change in the data file caused a chain reaction that changed the model outputs and thus invalidated `report_step`.
D. Yes. If a single target imports a data file, all targets in the pipeline rerun.

```{r}
answer4_rmd_data("E")
```

## An aside: parameterized R Markdown

[`tarchetypes::tar_render()`](https://docs.ropensci.org/tarchetypes/reference/tar_render.html) supports [parameterized R Markdown](https://rmarkdown.rstudio.com/developer_parameterized_reports.html), and parameters can be values of upstream targets in the pipeline. In the following pipeline:

```{r, eval = FALSE}
list(
  tar_target(data, data.frame(x = seq_len(26), y = letters))
  tar_render(report, "report.Rmd", params = list(your_param = data))
)
```

the `report` target will run:

```{r, eval = FALSE}
rmarkdown::render("report.Rmd", params = list(your_param = your_target))
```

where `report.Rmd` has the following YAML front matter:

```{yaml}
---
title: report
output_format: html_document
params:
  your_param: "default value"
---
```

and the following code chunk:

```{r, eval = FALSE}
print(params$your_param)
```

See [these examples](https://docs.ropensci.org/tarchetypes/reference/tar_render.html#examples) for a demonstration.