-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path5-files.Rmd
380 lines (268 loc) · 10.8 KB
/
5-files.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
---
title: "External data files and literate programming"
output: html_document
---
# About
`targets` can reproducibly watch input files, output files, and literate programming documents. This chapter explores how to configure a target to automatically run when a file changes.
# Setup
Run `tar_destroy()` to remove the targets from the previous chapter.
```{r}
library(targets)
tar_destroy()
```
Load this chapter's quiz questions. Try not to peek in advance.
```{r}
source("R/quiz.R")
source("5-files/answers.R")
```
Run the following command to write the required `_targets.R` script to the your working directory.
```{r, echo = FALSE}
file.copy("5-files/initial_targets.R", "_targets.R", overwrite = TRUE)
```
Open `_targets.R` for editing
```{r}
tar_edit()
```
# Review: input data files
As we saw in `3-changes.Rmd`, `targets` can watch files like `data/churn.csv` for changes. Let's review how this works. First, run the full pipeline from start to finish.
```{r}
tar_make()
```
Verify that all targets are now up to date.
```{r}
tar_make()
```
Now, remove the last line of data in the file.
```{r, message = FALSE}
library(tidyverse)
"data/churn.csv" %>%
read_csv(col_types = cols()) %>%
head(n = nrow(.) - 1) %>%
write_csv("data/churn.csv")
```
When we call `tar_make()` again, all targets should rerun because they all depend on the upstream file target `churn_file`.
```{r}
tar_make()
```
How do we configure `churn_file` and downstream targets to rerun when `data/churn.csv` changes?
A. The target's return value needs to be a character vector of file and directory paths. These paths get resolved at runtime, so we do not need to know them before we call `tar_make()`.
B. In `tar_target()`, set the `format` argument equal to `"file"`. That way, `tar_make()` knows the return value of `churn_file` is a bunch of file paths that need to be watched.
C. Targets directly downstream need to mention the symbol `churn_file` (as opposed to the literal path `"data/churn.csv"`) so `tar_make()` can discover the correct dependency relationships among targets. Always check with `tar_visnetwork()` to verify that your targets are connected properly in the dependency graph.
D. All the above.
```{r}
answer4_review("E")
```
Hint:
```{r}
tar_read(churn_file)
```
# Output files
In `targets`, we configure output files the exact same way. The only difference between input and output files is that output files are created when the target runs.
As an example, open `_targets.R` and create a new file targets `churn_cor`, which saves a CSV file and a plot of correlations (the correlation of each covariate with customer churn in the preprocessed testing data). Functions `compute_cor()` and `plot_cor()` in `5-files/functions.R` do most of the work.
Open `_targets.R` for editing.
```{r}
tar_edit()
```
Enter the new target below into the target list in `_targets.R`.
```{r, eval = FALSE}
tar_target(
churn_cor, {
cor <- compute_cor(churn_recipe)
plot <- plot_cor(cor)
write_csv(cor, "cor.csv")
ggsave(plot = plot, filename = "cor.png", width = 8, height = 8)
# The return value must be a vector of paths to the files we write:
paste0("cor.", c("csv", "png"))
},
format = "file" # Tells targets to track the return value (path) as a file.
)
```
Run the pipeline. Since all previous targets are up to date, only `churn_cor` should run.
```{r}
tar_make()
```
What is the return value of the new `churn_cor` target?
```{r}
tar_read(churn_cor)
```
A. `c("cor.csv", "cor.png")`, a vector of paths to the files produced by the target.
B. A `ggplot` object with the correlation plot.
C. A data frame of correlations.
D. A function called `churn_cor()`.
```{r}
answer4_return("E")
```
Take a look at the new output file `cor.png`. You should see a plot of each variable's correlation with customer churn. Also glance at `cor.csv`, the output dataset with the correlations.
```{r}
read_csv("cor.csv", col_types = cols())
```
Now, delete one of the output files and rerun the pipeline.
```{r}
tmp <- file.remove("cor.png")
tar_make()
```
What happened? Why?
A. All targets reran because we changed a file.
B. No target reran because `targets` does not track the deleted file.
C. Because `churn_cor` is a correctly configured file target, `tar_make()` noticed when `cor.png` changed and automatically reran the target in order to repair the file.
D. `churn_cor` because `targets` always treats character strings as file names.
```{r}
answer4_delete("E")
```
Whenever a single target like `churn_cor` tracks multiple files or directories, `tar_make()` treats all those files as a single unit. The whole target invalidates when one of the files changes, and downstream targets must accept all the files together. When we come to the chapter on branching, we will use the special `tar_files()` function from the `tarchetypes` package to branch over the available files.
# Literate programming
Literate programming is the practice of writing code and explanatory prose in the same source file. This R Markdown document is an example. All this time, we have been using literate programming on top of `targets`. But now, we will explore literate programming *within* a target.
## R Markdown on its own
Let's pull an example R Markdown file.
```{r}
tmp <- file.copy("5-files/results.Rmd", "results.Rmd", overwrite = TRUE)
```
Open `results.Rmd` for editing.
```{r}
library(usethis)
edit_file("results.Rmd", open = TRUE)
```
Notice the calls to `tar_load()` and `tar_read()` in active the code chunks. `results.Rmd`. On its own, this report leverages the results of previous targets.
```{r}
# Copy and paste the following directly into the R console.
# This code chunk will not render results.Rmd properly
# if called inside the R notebook.
library(rmarkdown)
render("results.Rmd")
browseURL("results.html")
```
## R Markdown inside a target
We can put `results.Rmd` in a target so it automatically re-renders when its dependencies change. Open `_targets.R` for editing.
```{r}
tar_edit()
```
Write `library(tarchetypes)` at the very top, and write `tar_render(report_step, "results.Rmd")` as a target in the pipeline.
```{r, eval = FALSE}
# Do not run here.
library(tarchetypes)
# Existing setup code goes here.
list(
# Existing calls to tar_target() stay here.
tar_render(report_step, "results.Rmd")
)
```
`tar_render()` analyzes `report.Rmd` and constructs a target that depends on the report's `tar_read()`/`tar_load()` dependencies. To see this for yourself, take a look at the dependency graph.
```{r}
tar_visnetwork()
```
Which targets does `report_step` depend on? Why?
A. None. No upstream targets are mentioned in the target's command.
B. `run_relu` because the report calls `tar_read(run_relu)` in an active code chunk.
C. `run_sigmoid` because the report calls `tar_load(run_sigmoid)` in an active code chunk.
D. `run_relu` and `run_sigmoid` because the report calls `tar_read(run_relu)` and `tar_load(run_sigmoid)` in active code chunks.
```{r}
answer4_deps1("E")
```
## Add a dependency
At the bottom of the `report.Rmd`, add an active code chunk to print out the best model (`tar_read(best_model)`). Then, look at the graph again.
```{r}
tar_visnetwork()
```
What changed? Why?
A. `report_step` is disconnected from the other nodes in the graph because we edited the report.
B. The graph did not change because we did not run the report yet.
C. `report_step` now depends on `best_model` in addition to `run_relu` and `run_sigmoid`. All three are mentioned in active code chunks with `tar_load()` and `tar_read()`.
D. `report_step` is only connected to `best_model` now because you just added it as a dependency.
```{r}
answer4_deps2("E")
```
## Run the report
Run the whole pipeline. The newly added `report_step` target should run.
```{r}
tar_make()
```
Verify that all targets are up to date now.
```{r}
tar_make() # See also tar_outdated()
```
View `results.html`. You should see a print-out of the best model at the bottom.
```{r}
browseURL("results.html")
```
## Remove the output file
Remove the output HTML file.
```{r}
unlink("results.html")
```
Then rerun the pipeline.
```{r}
tar_make()
```
What happened? Why?
A. All targets reran because the file system change.
B. The `report_step` target reran because `tar_render()` reproducibly tracks the output files of `rmarkdown::render()` and helps `tar_make()` respond to changes in `results.html`.
C. The `report_step` target reran because `results.Rmd` changed.
D. No targets reran because output files from R Markdown reports are not reproducibly tracked.
```{r}
answer4_html("E")
```
## Change the R Markdown source
Add some prose anywhere in the body of `results.Rmd`.
```{r}
library(usethis)
edit_file("results.Rmd", open = TRUE)
# Add comments in the report to explain the results.
```
Now, rerun the pipeline.
```{r}
tar_make()
```
What happened? Why?
A. Only `report_step` reran because the R Markdown source file changed and all the other targets stayed up to date.
B. Only `report_step` reran because R Markdown reports always rerun.
C. All targets reran because `results.Rmd` is a dependency of the whole pipeline.
D. No targets reran because the pipeline does not track changes to the R Markdown source.
```{r}
answer4_rmd("E")
```
## Change an upstream dependency
Remove another line of `data/churn.csv`.
```{r, message = FALSE}
"data/churn.csv" %>%
read_csv(col_types = cols()) %>%
head(n = nrow(.) - 1) %>%
write_csv("data/churn.csv")
```
Now, rerun the pipeline.
```{r}
tar_make()
```
Did `report_step` rerun? Why?
A. No, because `data/churn.csv` is not a dependency of `report_step`.
B. No, because the `report.Rmd` does not read `data/churn.csv`.
C. Yes. The R Markdown report is downstream of a target that depends on `data/churn.csv`, and the change in the data file caused a chain reaction that changed the model outputs and thus invalidated `report_step`.
D. Yes. If a single target imports a data file, all targets in the pipeline rerun.
```{r}
answer4_rmd_data("E")
```
## An aside: parameterized R Markdown
[`tarchetypes::tar_render()`](https://docs.ropensci.org/tarchetypes/reference/tar_render.html) supports [parameterized R Markdown](https://rmarkdown.rstudio.com/developer_parameterized_reports.html), and parameters can be values of upstream targets in the pipeline. In the following pipeline:
```{r, eval = FALSE}
list(
tar_target(data, data.frame(x = seq_len(26), y = letters))
tar_render(report, "report.Rmd", params = list(your_param = data))
)
```
the `report` target will run:
```{r, eval = FALSE}
rmarkdown::render("report.Rmd", params = list(your_param = your_target))
```
where `report.Rmd` has the following YAML front matter:
```{yaml}
---
title: report
output_format: html_document
params:
your_param: "default value"
---
```
and the following code chunk:
```{r, eval = FALSE}
print(params$your_param)
```
See [these examples](https://docs.ropensci.org/tarchetypes/reference/tar_render.html#examples) for a demonstration.