Skip to content

Commit

Permalink
Merge pull request #656 from jhudsl/winter25_data_summarization
Browse files Browse the repository at this point in the history
Winter25 data summarization
  • Loading branch information
clifmckee authored Jan 9, 2025
2 parents 6ebe834 + 101ba61 commit a5d0453
Show file tree
Hide file tree
Showing 3 changed files with 94 additions and 86 deletions.
156 changes: 82 additions & 74 deletions modules/Data_Summarization/Data_Summarization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,8 @@ output:


```{r, echo = FALSE, message=FALSE, error = FALSE}
library(knitr)
opts_chunk$set(comment = "", message = FALSE)
suppressWarnings({library(dplyr)})
library(readr)
library(tidyverse)
knitr::opts_chunk$set(comment = "", message = FALSE)
suppressWarnings(library(tidyverse))
```

<style type="text/css">
Expand All @@ -38,7 +35,7 @@ pre { /* Code block - slightly smaller in this lecture */

https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf

```{r, fig.alt="A preview of the Data transformation cheatsheet produced by RStudio.", out.width = "80%", echo = FALSE, align = "center"}
```{r, fig.alt="A preview of the Data transformation cheatsheet produced by RStudio.", out.width = "80%", echo = FALSE, fig.align = "center"}
knitr::include_graphics("images/Manip_cheatsheet.png")
```

Expand Down Expand Up @@ -99,7 +96,9 @@ sum(z)

## Some examples

We can use the `mtcars` built-in dataset. The `head` command displays the first rows of an object:
We can use the `mtcars` built-in dataset. "The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models)."

The `head` command displays the first rows of an object:

```{r}
head(mtcars)
Expand All @@ -112,23 +111,10 @@ A nice and readable way to chain together multiple R functions.

Changes `f(x, y)` to `x %>% f(y)`.

```{r eval=FALSE}
# Going to work
get_dressed(me,
pack_lunch(
check_pockets(
wallet = TRUE, phone = TRUE, keys = TRUE),
items = c("sandwich", "chips", "apple"), lunchbox = TRUE),
pants = TRUE, shirt = TRUE, footwear = "sandals")
# Going to work, the tidy way
me %>%
get_dressed(pants = TRUE, shirt = TRUE, footwear = "sandals") %>%
pack_lunch(items = c("sandwich", "chips", "apple"), lunchbox = TRUE) %>%
check_pockets(wallet = TRUE, phone = TRUE, keys = TRUE)
```{r, out.width = "50%", echo = FALSE, fig.align = "center"}
knitr::include_graphics("../../images/lol/morning_1.png")
```


## Statistical summarization the "tidy" way

```{r}
Expand All @@ -141,7 +127,7 @@ mtcars %>% pull(wt) %>% quantile(probs = 0.6)

## Behavior of `pull()` function

`pull()` converts a single data column into a vector. This allows you to run summary functions on these data. Once you have "pulled" the data column out, you don't have to name it again in any piped summary functions.
`pull()` converts a single data column into a <span style="color:blue">vector</span>. This allows you to run summary functions on these data. Once you have "pulled" the data column out, you don't have to name it again in any piped summary functions.

```{r}
cars_wt <- mtcars %>% pull(wt)
Expand All @@ -157,18 +143,29 @@ mtcars %>% pull(wt) %>% range(wt) # Incorrect
mtcars %>% pull(wt) %>% range() # Correct
```

## GUT CHECK

What kind of object do we need to run summary operators like `mean()` ?

A. A vector of numbers

B. A vector of characters

C. A dataset

# Summarization on tibbles (data frames)

## TB Incidence
## TB incidence

Let's read in a `tibble` of values from TB incidence.

"Tuberculosis incidence, all forms (per 100,000 population per year), for the period 1990-2007 across 208 countries/territories."

```{r}
tb <- read_csv("https://jhudatascience.org/intro_to_r/data/tb.csv")
```

## TB Incidence
## TB incidence

Check out the data:

Expand All @@ -177,7 +174,7 @@ head(tb)
```


## TB Incidence
## TB incidence

Check out the data:

Expand All @@ -193,7 +190,6 @@ Before we go further, let's rename the first column using the `rename()` functio
In this case, we have to use the backticks (\`) because there are spaces and funky characters in the name.

```{r}
library(dplyr)
tb <- tb %>%
rename(country = `TB incidence, all forms (per 100 000 population per year)`)
```
Expand All @@ -220,8 +216,8 @@ You can also do more elaborate summaries across different groups of data using `
```{r, eval = FALSE}
# General format - Not the code!
{data to use} %>%
summarize({summary column name} = {operator(source column)},
{summary column name} = {operator(source column)})
summarize({summary column name} = {function(source column)},
{summary column name} = {function(source column)})
```
</div>

Expand All @@ -234,7 +230,7 @@ You can also do more elaborate summaries across different groups of data using `
```{r, eval = FALSE}
# General format - Not the code!
{data to use} %>%
summarize({summary column name} = {operator(source column)})
summarize({summary column name} = {function(source column)})
```
</div>

Expand Down Expand Up @@ -291,10 +287,11 @@ summary(tb)

## Summary & Lab Part 1

- summary stats (`mean()`) work with `pull()`
- `pull()` creates a *vector*
- don't forget the `na.rm = TRUE` argument!
- `summary(x)`: quantile information
- `summarize`: creates a summary table of columns of interest
- summary stats (`mean()`) work with vectors or with `summarize()`

🏠 [Class Website](https://jhudatascience.org/intro_to_r/)

Expand All @@ -306,6 +303,8 @@ summary(tb)
Here we will be using the Youth Tobacco Survey data:
http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv

* Check out the data at: https://catalog.data.gov/dataset/youth-tobacco-survey-yts-data

```{r}
yts <- read_csv("http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv")
head(yts)
Expand All @@ -324,7 +323,7 @@ yts %>%

## How many `distinct()` values?

`n_distinct()` tells you the number of unique elements. _Must pull the column first!_
`n_distinct()` tells you the number of unique elements. It needs a vector so you _must pull the column first!_

```{r}
yts %>%
Expand All @@ -338,7 +337,7 @@ options(max.print = 1000)
```


## `dplyr`: `count`
## Use `count()` to return row count per category.

Use `count` to return a frequency table of unique elements of a data.frame.

Expand All @@ -347,31 +346,33 @@ yts %>% count(LocationDesc)
```


## `dplyr`: `count`

Multiple columns listed further subdivides the count.
## Multiple columns listed further subdivides the `count()`

```{r, message = FALSE}
yts %>% count(LocationDesc, TopicDesc)
```

**Note:** `count()` includes NAs

## `dplyr`: `count`

Multiple columns listed further subdivides the count.
## GUT CHECK

```{r, message = FALSE}
yts %>% count(LocationDesc, TopicDesc)
```
The `count()` function can help us tally:

<br>
A. Sample size

**Note:** `count()` includes NAs
B. Rows per each category

C. How many categories

# Grouping

## Perform Operations By Groups: dplyr
## Goal

We want to find the average frequency that youth use tobacco products in the dataset.

_How do we do this?_

## Perform operations By groups: dplyr

`group_by` allows you group the data set by variables/columns you specify:

Expand All @@ -381,7 +382,7 @@ yts
```


## Perform Operations By Groups: dplyr
## Perform operations by groups: dplyr

`group_by` allows you group the data set by variables/columns you specify:

Expand All @@ -400,7 +401,7 @@ yts_grouped %>% summarize(avg_percent = mean(Data_Value, na.rm = TRUE))
```


## Use the `pipe` to string these together!
## Do it in one step: use `%>%` to string these together!

Pipe `yts` into `group_by`, then pipe that into `summarize`:

Expand Down Expand Up @@ -474,20 +475,19 @@ yts %>%
`count()` and `n()` can give very similar information.

```{r}
mtcars %>% count(cyl)
mtcars %>% group_by(cyl) %>% summarize(n()) # n() typically used with summarize
yts %>% count(YEAR) %>% head(n = 3)
yts %>% group_by(YEAR) %>% summarize(n = n()) %>% head(n = 3) # n() typically used with summarize
```


# A few miscellaneous topics ..
# A few miscellaneous topics


## Base R functions you might see: `length` and `unique`

These functions require a column as a vector using `pull()`.

```{r, message = FALSE}
yts <- read_csv("http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv")
yts_loc <- yts %>% pull(LocationDesc) # pull() to make a vector
yts_loc %>% unique() # similar to distinct()
```
Expand All @@ -500,38 +500,26 @@ These functions require a column as a vector using `pull()`.
yts_loc %>% unique() %>% length() # similar to n_distinct()
```

## * New! * Many dplyr functions now have a `.by=` argument

Pipe `yts` into `group_by`, then pipe that into `summarize`:

```{r eval = FALSE}
yts %>%
group_by(Response) %>%
summarize(avg_percent = mean(Data_Value, na.rm = TRUE),
max_percent = max(Data_Value, na.rm = TRUE))
```

is the same as..

```{r eval = FALSE}
yts %>%
summarize(avg_percent = mean(Data_Value, na.rm = TRUE),
max_percent = max(Data_Value, na.rm = TRUE),
.by = Response)
```


## `summary()` vs. `summarize()`

* `summary()` (base R) gives statistics table on a dataset.
* `summarize()` (dplyr) creates a more customized summary tibble/dataframe.

## Functions you might also see

* `rowwise`()`: functions will compute results for each row
* `sum(!is.na())`: # of non-NAs in the data
* `first()`: first value in the data
* `last()`: last value in the data
* `range()`: minimum and maximum of the data
* `IQR()`: interquartile range of the data

## Summary & Lab Part 2

- `count(x)`: what unique values do you have?
- `distinct()`: what are the distinct values?
- `n_distinct()` with `pull()`: how many distinct values?
- `group_by()`: changes all subsequent functions
- `group_by()`: changes subsequent functions (remove with `ungroup()`)
- combine with `summarize()` to get statistics per group
- combine with `mutate()` to add column
- `summarize()` with `n()` gives the count (NAs included)
Expand All @@ -540,7 +528,7 @@ yts %>%

💻 [Lab](https://jhudatascience.org/intro_to_r/modules/Data_Summarization/lab/Data_Summarization_Lab.Rmd)

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
```{r, fig.alt="The End", out.width = "20%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```

Expand Down Expand Up @@ -592,3 +580,23 @@ tb %>%
tb %>%
summarize(across(starts_with("year"), ~mean(.x, na.rm = TRUE)))
```

## * New! * Many dplyr functions now have a `.by=` argument

Pipe `yts` into `group_by`, then pipe that into `summarize`:

```{r eval = FALSE}
yts %>%
group_by(Response) %>%
summarize(avg_percent = mean(Data_Value, na.rm = TRUE),
max_percent = max(Data_Value, na.rm = TRUE))
```

is the same as..

```{r eval = FALSE}
yts %>%
summarize(avg_percent = mean(Data_Value, na.rm = TRUE),
max_percent = max(Data_Value, na.rm = TRUE),
.by = Response)
```
12 changes: 6 additions & 6 deletions modules/Data_Summarization/lab/Data_Summarization_Lab.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ bike <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Bike_Lanes.cs

### 1.1

How many bike "lanes" are currently in Baltimore? You can assume each observation/row is a different bike "lane". (hint: how do you get the number of rows of a data set? You can use `dim()` or `nrow()` or another function).
How many streets with designated bike lanes are currently in Baltimore? You can assume each observation/row is a different street with one or more bike lanes. (Hint: how do you get the number of rows of a data set? You can use `dim()` or `nrow()` or another function).

```{r 1.1response}
Expand All @@ -47,7 +47,7 @@ Summarize the data to get the `max` of `length` using the `summarize` function.
```
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN))
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN))
```

```{r 1.3response}
Expand All @@ -61,8 +61,8 @@ Modify your code from 1.3 to add the `min` of `length` using the `summarize` fun
```
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN),
SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN)
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN),
SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN)
)
```

Expand All @@ -80,8 +80,8 @@ Summarize the `bike` data to get the mean of `length` and `dateInstalled`. Make
```
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE),
SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE)
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN, na.rm = TRUE),
SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN, na.rm = TRUE)
)
```

Expand Down
Loading

0 comments on commit a5d0453

Please sign in to comment.