-
Notifications
You must be signed in to change notification settings - Fork 701
/
Copy pathchoosing_visualization_software.Rmd
171 lines (135 loc) · 18.9 KB
/
choosing_visualization_software.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(ggthemes)
library(forcats)
```
# Choosing the right visualization software {#choosing-visualization-software}
Throughout this book, I have purposefully avoided one critical question of data visualization: How do we actually generate our figures? What tools should we use? This question can generate heated discussions, as many people have strong emotional bonds to the specific tools they are familiar with. I have often seen people vigorously defend their own preferred tools instead of investing time to learn a new approach, even if the new approach has objective benefits. And I will say that sticking with the tools you know is not entirely unreasonable. Learning any new tool will require time and effort, and you will have to go through a painful transition period where getting things done with the new tool is much more difficult than it was with the old tool. Whether going through this period is worth the effort can usually only be evaluated in retrospect, after one has made the investment to learn the new tool. Therefore, regardless of the pros and cons of different tools and approaches, the overriding principle is that you need to pick a tool that works for you. If you can make the figures you want to make, without excessive effort, then that's all that matters.
```{block type='rmdtip', echo=TRUE}
The best visualization software is the one that allows you to make the figures you need.
```
Having said this, I do think there are general principles we can use to assess the relative merits of different approaches to producing visualizations. These principles roughly break down by how reproducible the visualizations are, how easy it is to rapidly explore the data, and to what extent the visual appearance of the output can be tweaked.
## Reproducibility and repeatability
In the context of scientific experiments, we refer to work as reproducible if the overarching scientific finding of the work will remain unchanged if a different research group performs the same type of study. For example, if one research group finds that a new pain medication reduces perceived headache pain significantly without causing noticeable side effects and a different group subsequently studies the same medication on a different patient group and has the same findings, then the work is reproducible. By contrast, work is repeatable if very similar or identical measurements can be obtained by the same person repeating the exact same measurement procedure on the same equipment. For example, if I weigh my dog and find she weighs 41 lbs and then I weigh her again on the same scales and find again that she weighs 41 lbs, then this measurement is repeatable.
With minor modifications, we can apply these concepts to data visualization. A visualization is reproducible if the plotted data are available and any data transformations that may have been applied are exactly specified. For example, if you make a figure and then send me the exact data that you plotted, then I can prepare a figure that looks substantially similar. We may be using slightly different fonts or colors or point sizes to display the same data, so the two figures may not be exactly identical, but your figure and mine convey the same message and therefore are reproductions of each other. A visualization is repeatable, on the other hand, if it is possible to recreate the exact same visual appearance, down to the last pixel, from the raw data. Strictly speaking, repeatability requires that even if there are random elements in the figure, such as jitter (Chapter \@ref(overlapping-points)), those elements were specified in a repeatable way and can be regenerated at a future date. For random data, repeatability generally requires that we specify a particular random number generator for which we set and record a seed.
Throughout this book, we have seen many examples of figures that reproduce but don't repeat other figures. For example, Chapter \@ref(avoid-line-drawings) shows several sets of figures where all figures in each set show the same data but each figure in each set looks somewhat different. Similarly, Figure \@ref(fig:lincoln-repro)a is a repeat of Figure \@ref(fig:lincoln-temp-jittered), down to the random jitter that was applied to each data point, whereas Figure \@ref(fig:lincoln-repro)b is only a reproduction of that figure. Figure \@ref(fig:lincoln-repro)b has different jitter than Figure \@ref(fig:lincoln-temp-jittered), and it also uses a sufficiently different visual design that the two figures look quite distinct, even if they clearly convey the same information about the data.
(ref:lincoln-repro) Repeat and reproduction of a figure. Part (a) is a repeat of Figure \@ref(fig:lincoln-temp-jittered). The two figures are identical down to the random jitter that was applied to each point. By contrast, part (b) is a reproduction but not a repeat. In particular, the jitter in part (b) differs from the jitter in part (a) or in Figure \@ref(fig:lincoln-temp-jittered).
```{r lincoln-repro, fig.width = 5.5*6/4.2, fig.asp = .32, fig.cap = '(ref:lincoln-repro)'}
ggridges::lincoln_weather %>%
mutate(
month_short = fct_recode(
Month,
Jan = "January",
Feb = "February",
Mar = "March",
Apr = "April",
May = "May",
Jun = "June",
Jul = "July",
Aug = "August",
Sep = "September",
Oct = "October",
Nov = "November",
Dec = "December"
)
) %>%
mutate(month_short = fct_rev(month_short)) -> lincoln_df
lincoln1 <- ggplot(
lincoln_df, aes(x = month_short, y = `Mean Temperature [F]`)
) +
geom_point(
position = position_jitter(width = .15, height = 0, seed = 320),
size = .35
) +
xlab("month") +
ylab("mean temperature (°F)") +
theme_dviz_open(.655*14) +
theme(
axis.line = element_line(size = .655*.5),
axis.ticks = element_line(size = .655*.5),
plot.margin = margin(2, 6, 2, 1.5)
)
lincoln2 <- ggplot(
lincoln_df, aes(x = month_short, y = `Mean Temperature [F]`)
) +
geom_point(
position = position_jitter(width = .2, height = 0, seed = 321),
size = .5, color = darken("#0072B2", .3)
) +
xlab("month") +
ylab("mean temperature (°F)") +
theme_dviz_grid(12) +
theme(
axis.ticks.length = grid::unit(0, "pt"),
axis.ticks = element_blank()
)
plot_grid(lincoln1, lincoln2, labels = "auto", ncol = 2)
```
Both reproducibility and repeatability can be difficult to achieve when we're working with interactive plotting software. Many interactive programs allow you to transform or otherwise manipulate the data but don't keep track of every individual data transformation you perform, only of the final product. If you make a figure using this kind of a program, and then somebody asks you to reproduce the figure or create similar one with a different data set, you might have difficulty to do so. During my years as a postdoc and a young assistant professor, I used an interactive program for all my scientific visualizations, and this exact issue happened to me several times. For example, I had made several figures for a scientific manuscript. When I wanted to revise the manuscript a few months later and needed to reproduce a slightly altered version of one of the figures, I realized that I wasn't quite sure anymore how I had made the original figure in the first place. This experience has taught me to stay away from interactive programs as much as possible. I now make figures programmatically, by writing code (scripts) that generates the figures from the raw data. Programmatically generated figures will generally be repeatable by anybody who has access to the generating scripts and the programming language and specific libraries used.
## Data exploration versus data presentation
There are two distinct phases of data visualization, and they have very different requirements. The first is data exploration. Whenever you start working with a new dataset, you need to look at it from different angles and try various ways of visualizing it, just to develop an understanding of the dataset's key features. In this phase, speed and efficiency are of the essence. You need to try different types of visualizations, different data transformations, and different subsets of the data. The faster you can iterate through different ways of looking at the data, the more you will explore, and the higher the likelihood that you will notice an important feature in the data that you might otherwise have overlooked. The second phase is data presentation. You enter it once you understand your dataset and know what aspects of it you want to show to your audience. The key objective in this phase is to prepare a high-quality, publication-ready figure that can be printed in an article or book, included in a presentation, or posted on the internet.
In the exploration stage, whether the figures you make look appealing is secondary. It's fine if the axis labels are missing, the legend is messed up, or the symbols are too small, as long as you can evaluate the various patterns in the data. What is critical, however, is how easy it is for you to change how the data are shown. To truly explore the data, you should be able to rapidly move from a scatter plot to overlapping density distribution plots to boxplots to a heatmap. In Chapter \@ref(aesthetic-mapping), we have discussed how all visualizations consist of mappings from data onto aesthetics. A well-designed data exploration tool will allow you to easily change which variables are mapped onto which aesthetics, and it will provide a wide range of different visualization options within a single coherent framework. In my experience, however, many visualization tools (and in particular libraries for programmatic figure generation) are not set up in this way. Instead, they are organized by plot type, where each different type of plot requires somewhat different input data and has its own idiosyncratic interface. Such tools can get in the way of efficient data exploration, because it's difficult to remember how all the different plot types work. I encourage you to carefully evaluate whether your visualization software allows for rapid data exploration or whether it tends to get in the way. If it more frequently tends to get in the way, you may benefit from exploring alternative visualization options.
Once we have determined how exactly we want to visualize our data, what data transformations we want to make, and what type of plot to use, we will commonly want to prepare a high-quality figure for publication. At this point, we have several different avenues we can pursue. First, we can finalize the figure using same software platform we used for initial exploration. Second, we can switch platform to one that provides us finer control over the final product, even if that platform makes it harder to explore. Third, we can produce a draft figure with a visualization software and then manually post-process with an image manipulation or illustration program such as Photoshop or Illustrator. Fourth, we can manually redraw the entire figure from scratch, either with pen and paper or using an illustration program.
All these avenues are reasonable. However, I would like to caution against manually sprucing up figures in routine data analysis pipelines or for scientific publications. Manual steps in the figure preparation pipeline make repeating or reproducing a figure inherently difficult and time-consuming. And in my experience from working in the natural sciences, we rarely make a figure just once. Over the course of a study, we may redo experiments, expand the original dataset, or repeat an experiment several times with slightly altered conditions. I've seen it many times that late in the publication process, when we think everything is done and finalized, we end up introducing a small modification to how we analyze our data, and consequently all figures have to be redrawn. And I've also seen, in similar situations, that a decision is made not to redo the analysis or not to redraw the figures, either due to the effort involved or because the people who had made the original figure have moved on and aren't available anymore. In all these scenarios, an unnecessarily complicated and non-reproducible data visualization pipeline interferes with producing the best possible science.
Having said this, I have no principled concern about hand-drawn figures or figures that have been manually post-processed, for example to change axis labels, add annotations, or modify colors. These approaches can yield beautiful and unique figures that couldn't easily be made in any other way. In fact, as sophisticated and polished computer-generated visualizations are becoming increasingly commonplace, I observe that manually drawn figures are making somewhat of a resurgence (see Figure \@ref(fig:sequencing-cost) for an example). I think this is the case because such figures represent a unique and personalized take on what might otherwise be a somewhat sterile and routine presentation of data.
(ref:sequencing-cost) After the introduction of next-gen sequencing methods, the sequencing cost per genome has declined much more rapidly than predicted by Moore's law. This hand-drawn figure reproduces a widely publicized visualization prepared by the National Institutes of Health. Data source: National Human Genome Research Institute
```{r sequencing-cost, out.width = "70%", fig.cap='(ref:sequencing-cost)'}
knitr::include_graphics("figures/sequencing_costs.png", auto_pdf = FALSE)
```
## Separation of content and design
A good visualization software should allow you to think separately about the content and the design of your figures. By content, I refer to the specific data set shown, the data transformations applied (if any), the specific mappings from data onto aesthetics, the scales, the axis ranges, and the type of plot (scatter plot, line plot, bar plot, boxplot, etc.). Design, on the other hand, describes features such as the foreground and background colors, font specifications (e.g. font size, face, and family), symbol shapes and sizes, the placement of legends, axis ticks, axis titles, and plot titles, and whether or not the figure has a background grid. When I work on a new visualization, I usually determine first what the contents should be, using the kind of rapid exploration described in the previous subsection. Once the contents is set, I may tweak the design, or more likely I will apply a pre-defined design that I like and/or that gives the figure a consistent look in the context of a larger body of work.
In the software I have used for this book, ggplot2, separation of content and design is achieved via themes. A theme specifies the visual appearance of a figure, and it is easy to take an existing figure and apply different themes to it (Figure \@ref(fig:unemploy-themes)). Themes can be written by third parties and distributed as R packages. Through this mechanism, a thriving ecosystem of add-on themes has developed around ggplot2, and it covers a wide range of different styles and application scenarios. If you're making figures with ggplot2, you can almost certainly find an existing theme that satisfies your design needs.
(ref:unemploy-themes) Number of unemployed persons in the U.S. from 1970 to 2015. The same figure is displayed using four different ggplot2 themes: (a) the default theme for this book; (b) the default theme of ggplot2, the plotting software I have used to make all figures in this book; (c) a theme that mimics visualizations shown in the Economist; (d) a theme that mimics visualizations shown by FiveThirtyEight. FiveThirtyEight often forgoes axis labels in favor of plot titles and subtitles, and therefore I have adjusted the figure accordingly. Data source: U.S. Bureau of Labor Statistics
```{r unemploy-themes, fig.width = 5.5*6/4.2, fig.asp = 0.75, fig.cap = '(ref:unemploy-themes)'}
unemploy_base <- ggplot(economics, aes(x = date, y = unemploy)) +
scale_y_continuous(
name = "unemployed (x1000)",
limits = c(0, 17000),
breaks = c(0, 5000, 10000, 15000),
labels = c("0", "5000", "10,000", "15,000"),
expand = c(0.04, 0)
) +
scale_x_date(
name = "year",
expand = c(0.01, 0)
)
unemploy_p1 <- unemploy_base +
theme_dviz_grid(12) +
geom_line(color = "#0072B2") +
theme(
axis.ticks.length = grid::unit(0, "pt"),
axis.ticks = element_blank(),
plot.margin = margin(6, 6, 6, 2)
)
unemploy_p2 <- unemploy_base + geom_line() + theme_gray()
unemploy_p3 <- unemploy_base +
geom_line(aes(color = "unemploy"), show.legend = FALSE, size = .75) +
theme_economist() +
scale_color_economist() +
theme(panel.grid.major = element_line(size = .75))
unemploy_p4 <- unemploy_base +
geom_line(aes(color = "unemploy"), show.legend = FALSE) +
scale_color_fivethirtyeight() +
labs(
title = "United States unemployment",
subtitle = "Unemployed persons (in thousands) from\n1967 to 2015"
) +
theme_fivethirtyeight() +
theme(
plot.title = element_text(size = 12, margin = margin(0, 0, 3, 0)),
plot.subtitle = element_text(size = 10, lineheight = 1)
)
plot_grid(
unemploy_p1, NULL, unemploy_p2,
NULL, NULL, NULL,
unemploy_p3, NULL, unemploy_p4,
labels = c("a", "", "b", "", "", "", "c", "", "d"),
hjust = -.5,
vjust = 1.5,
rel_widths = c(1, .02, 1),
rel_heights = c(1, .02, 1)
)
```
Separation of content and design allows data scientists and designers to each focus on what they do best. Most data scientists are not designers, and therefore their primary concern should be the data, not the design of a visualization. Likewise, most designers are not data scientists, and they should be able provide a unique and appealing visual language for figures without having to worry about specific data, appropriate transformations, and so on. The same principle of separating content and design has long been followed in the publishing world of books, magazines, newspapers, and websites, where writers provide content but not layout or design. Layout and design are created by a separate group of people who specialize in this area and who ensure that the publication appears in a visually consistent and appealing style. This principle is logical and useful, but it is not yet that widespread in the data visualization world.
In summary, when choosing your visualization software, think about how easily you can reproduce figures and redo them with updated or otherwise changed datasets, whether you can rapidly explore different visualizations of the same data, and to what extent you can tweak the visual design separately from generating the figure content. Depending on your skill level and comfort with programming, it may be beneficial to use different visualization tools at the data exploration and the data presentation stages, and you may prefer to do the final visual tweaking interactively or by hand. If you have to make figures interactively, in particular with a software that does not keep track of all data transformations and visual tweaks you have applied, consider taking careful notes on how you make each figure, so that all your work remains reproducible.