generated from jhudsl/OTTR_Template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-data-wrangling1.Rmd
192 lines (111 loc) · 10 KB
/
03-data-wrangling1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# Data Wrangling with Tidy Data, Part 1
```{r, echo=F, message=F, warning=F, error=F}
install.packages("palmerpenguins", repos = "http://cran.us.r-project.org")
library(palmerpenguins)
library(tidyverse)
load(url("https://github.com/fhdsl/Intro_to_R/raw/main/classroom_data/CCLE.RData"))
```
From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis.
![Data science workflow. Image source: [R for Data Science.](https://r4ds.hadley.nz/whole-game)](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){width="550"}
For the rest of the course, we focus on *Transform* and *Visualize* with the assumption that our data is in a nice, "Tidy format". First, we need to understand what it means for a data to be "Tidy".
## Tidy Data
Here, we describe a standard of organizing data. It is important to have standards, as it facilitates a consistent way of thinking about data organization and building tools (functions) that make use of that standard. The principles of **tidy data**, developed by Hadley Wickham:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
If you want to be technical about what variables and observations are, Hadley Wickham describes:
> A **variable** contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An **observation** contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
![A tidy dataframe. Image source: [R for Data Science](https://r4ds.hadley.nz/data-tidy).](https://r4ds.hadley.nz/images/tidy-1.png){width="800"}
## Examples and counter-examples of Tidy Data:
Consider the following three datasets, which all contain the exact same information:
```{r}
table1
```
This `table1` satisfies the the definition of Tidy Data. The observation is a country's year, and the variables are attributes of each country's year.
```{r}
head(table2)
```
Something is strange able `table2`. The observation is still a country's year, but "type" and "count" are not clear attributes of each country's year.
```{r}
table3
```
In `table3`, we have multiple values for each cell under the "rate" column.
## Our working Tidy Data: DepMap Project
The [Dependency Map project](https://depmap.org/portal/) is a multi-omics profiling of cancer cell lines combined with functional assays such as CRISPR and drug sensitivity to help identify cancer vulnerabilities and drug targets. Here are some of the data that we have public access. We have been looking at the metadata since last session.
- Metadata
- Somatic mutations
- Gene expression
- Drug sensitivity
- CRISPR knockout
- and more...
Let's see how these datasets fit the definition of Tidy data:
```
```
| Dataframe | The observation is | Some variables are | Some values are |
|------------------|------------------|-------------------|------------------|
| metadata | Cell line | ModelID, Age, OncotreeLineage | "ACH-000001", 60, "Myeloid" |
| expression | Cell line | KRAS_Exp | 2.4, .3 |
| mutation | Cell line | KRAS_Mut | TRUE, FALSE |
## Transform: "What do you want to do with this dataframe"?
Remember that a major theme of the course is about: **How we organize ideas \<-\> Instructing a computer to do something.**
Until now, we haven't focused too much on how we organize our scientific ideas to interact with what we can do with code. Let's pivot to write our code driven by our scientific curiosity. After we are sure that we are working with Tidy data, we can ponder how we want to transform our data that satisfies our scientific question. We will look at several ways we can transform tidy data, starting with subsetting columns and rows.
Here's a starting prompt:
> In the `metadata` dataframe, which rows would you filter for and columns would you select that relate to a scientific question?
We should use the implicit subsetting mindset here: ie. "I want to filter for rows such that the Subtype is breast cancer and look at the Age and Sex." and *not* "I want to filter for rows 20-50 and select columns 2 and 8".
*Notice that when we filter for rows in an implicit way, we often formulate our criteria about the columns.*
(This is because we are guaranteed to have column names in dataframes, but not usually row names. Some dataframes have row names, but because the data types are not guaranteed to have the same data type across rows, it makes describing by row properties difficult.)
Let's convert our implicit subsetting criteria into code!
```{r}
metadata_filtered = filter(metadata, OncotreeLineage == "Breast")
breast_metadata = select(metadata_filtered, ModelID, Age, Sex)
head(breast_metadata)
```
Here, `filter()` and `select()` are functions from the `tidyverse` package, which we have to install and load in via `library(tidyverse`) before using these functions.
### Filter rows
Let's carefully a look what how the R Console is interpreting the `filter()` function:
- We evaluate the expression right of `=`.
- The first argument of `filter()` is a dataframe, which we give `metadata`.
- The second argument is strange: the expression we give it looks like a logical indexing vector built from a comparison operator, but the variable `OncotreeLineage` does not exist in our environment! Rather, `OncotreeLineage` is a column from `metadata`, and we are referring to it as a **data variable** in the context of the dataframe `metadata`. So, we make a comparison operation on the column `OncotreeLineage` from `metadata` and its resulting logical indexing vector is the input to the second argument.
- How do we know when a variable being used is a variable from the environment, or a data variable from a dataframe? It's not clear cut, but here's a rule of thumb: most functions from the `tidyverse` package allows you to use data variables to refer to columns of a dataframe. We refer to documentation when we are not sure.
- This encourages more *readable* code at the expense of consistency of referring to variables in the environment. The authors of this package [describes this trade-off](https://dplyr.tidyverse.org/articles/programming.html#data--and-env-variables).
- Putting it together, `filter()` takes in a dataframe, and an logical indexing vector described by data variables as arguments, and returns a data frame with rows that match condition described by the logical indexing vector.
- Store this in `metadata_filtered` variable.
### Select columns
Let's carefully a look what how the R Console is interpreting the `select()` function:
- We evaluate the expression right of `=`.
- The first argument of `filter()` is a dataframe, which we give `metadata`.
- The second and third arguments are **data variables** referring the columns of `metadata`.
- For certain functions like `filter()`, there is no limit on the number of arguments you provide. You can keep adding data variables to select for more column names.
- Putting it together, `select()` takes in a dataframe, and as many data variables you like to select columns, and returns a dataframe with the columns you described by data variables.
- Store this in `breast_metadata` variable.
## Summary Statistics
Now that your dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the observations of a variable in a numeric summary.
If the columns of interest are numeric, then you can try functions such as `mean()`, `median()`, `mode()`, or `summary()` to get summary statistics of the column. If the columns of interest is character or logical, then you can try the `table()` function.
All of these functions take in a vector as input and not a dataframe, so you have to access the column as a vector via the `$` operation.
```{r}
mean(breast_metadata$Age, na.rm = TRUE)
table(breast_metadata$Sex)
```
## Pipes
Often, in data analysis, we want to transform our dataframe in multiple steps via different functions. This leads to nested function calls, like this:
```{r}
breast_metadata = select(filter(metadata, OncotreeLineage == "Breast"), ModelID, Age, Sex)
```
This is a bit hard to read. A computer doesn't care how difficult it is to read this line of code, but there is a lot of instructions going on in one line of code. This multi-step function composition will lead to an unreadable pattern such as:
```
result = function3(function2(function1(dataframe, df_col4, df_col2), arg2), df_col5, arg1)
```
To untangle this, you have to look into the middle of this code, and slowly step out of it.
To make this more readable, programmers came up with an alternative syntax for function composition via the **pipe** metaphor. The ideas is that we push data through a chain of connected pipes, in which the output of a pipe becomes the input of the subsequent pipe.
Instead of a syntax like `result2 = function3(function2(function1(dataframe)))`,
we linearize it with the `%>%` symbol: `result2 = dataframe %>% function1 %>% function2 %>% function3`.
In the previous example,
```
result = dataframe %>% function1(df_col4, df_col2) %>%
function2(arg2) %>%
function3(df_col5, arg1)
```
This looks much easier to read. Notice that we have broken up one expression in to three lines of code for readability. If a line of code is incomplete (the first line of code is piping to somewhere unfinished), the R will treat the next line of code as part of the current line of code.
Try to rewrite the `select()` and `filter()` function composition example above using the pipe metaphor and syntax.
## Exercises
You can find [exercises and solutions on Posit Cloud](https://posit.cloud/content/8245357), or on [GitHub](https://github.com/fhdsl/Intro_to_R_Exercises).