-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexploration.Rmd
67 lines (54 loc) · 1.33 KB
/
exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
title: "Exploratory data analysis"
author: "Guillem Hurault"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
fig.height = 5,
fig.width = 12,
dpi = 200)
```
```{r initialisation}
library(tidyverse)
train <- targets::tar_read(training_data)
```
```{r exploration1}
head(train)
summary(train)
pairs(train[, 1:10])
```
## Visualising missing values
```{r exploration-missing}
train %>%
select(-Y) %>%
is.na() %>%
as_tibble() %>%
mutate(Index = 1:n()) %>%
pivot_longer(-Index) %>%
ggplot(aes(x = name, y = Index, fill = value)) +
geom_tile() +
labs(x = "Variable", fill = "Missing") +
coord_cartesian(expand = FALSE) +
scale_fill_manual(values = c("#000000", "#E69F00")) +
theme_classic(base_size = 15)
```
## Multicollinearity
For plotting, we impute missing values by 0 (mean of each feature).
```{r exploration-correlation}
train %>%
select(-Y, -Xb1, -Xb2) %>%
replace(is.na(.), 0) %>%
cor() %>%
corrplot::corrplot.mixed()
```
## Distribution of the outcome variable
```{r exploration-outcome}
table(train[["Y"]]) %>%
barplot(main = "Count of outcome variable")
```