RedWine/RedWine.Rmd

---
title: "Red Wine Quality Analysis"
author: "by Arturo Parrales Salinas"
date: "8/2/2018"
output: html_document
---

***
***

### Main Research Goal

The variable that interest us the most is **quality** since we want to understand **which chemical properties influence the quality of red wines**.

```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages that you end up using in your analysis in this code
# chunk.

# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.

# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
#install.packages('corrplot')
#install.packages('psych')


library(corrplot)
library(purrr)
library(tidyr)
library(ggplot2)
library(psych)
library(gridExtra)
```

We load the data set of red wines quality. This dataset is tidy and contains 1,599 red wines records with 12 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The 12 variables of the wine are listed below:

```{r echo=FALSE, Load_the_Data}
# Load the Data
red <- read.csv('wineQualityReds.csv')

str(red)
```

We can see a variable X which indicates the index of the record in the dataset. We definitively want to remove X before we move forward

```{r echo=FALSE, Univariate_Plots_Remove_X}
# remove X
red <- red[,-1]
str(red)
```

Once we removed X, we can continue to understand the variables on the dataset.

### Attribute Measure Units of Each Variable

   For more information, read [Cortez et al., 2009].

   **Input variables** (based on physicochemical tests):
   
   1. **Fixed acidity** (tartaric acid - $g / dm^3$)
   2. **Volatile acidity** (acetic acid - $g / dm^3$)
   3. **Citric acid** ($g / dm^3$)
   4. **Residual sugar** ($g / dm^3$)
   5. **Chlorides** (sodium chloride - $g / dm^3$)
   6. **Free sulfur dioxide** ($mg / dm^3$)
   7. **Total sulfur dioxide** ($mg / dm^3$)
   8. **Density** ($g / cm^3$)
   9. **pH**
   10. **Sulphates** (potassium sulphate - $g / dm3$)
   11. **Alcohol** (% by volume)
   
   
   **Output variable** (based on sensory data): 
   
   12. **Quality** (score between 0 and 10)

  >  <font size="2">Note: All Missing Attribute Values are set as None.</font>

### Description of Attributes

   1. **Fixed acidity:** most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

   2. **Volatile acidity:** the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

   3. **Citric acid:** found in small quantities, citric acid can add 'freshness' and flavor to wines

   4. **Residual sugar:** the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

   5. **Chlorides:** the amount of salt in the wine

   6. **Free sulfur dioxide:** the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

   7. **Total sulfur dioxide:** amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

   8. **Density:** the density of water is close to that of water depending on the percent alcohol and sugar content

   9. **pH:** describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

   10. **Sulphates:** a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

   11. **Alcohol:** the percent alcohol content of the wine

   **Output variable** (based on sensory data): 
   
   12. **Quality:** the score of a red wine, between 0 (lowest) and 10 (highest)

***

# Univariate Plots Section

In the previous sections, we have an overview of the dataset and here we can start with a summary of the dataset information for each variable.

```{r echo=FALSE, Univariate_Plots_Var_Summary}
# dataset summary
summary(red)
```

Let's start looking at the **quality** summary, we can notice that the lowest quality of red wines was 3 and the maximum was 8. This tell us there are neither very bad quality wines nor very excellent wines in this dataset. Also, we want to make sure our **quality** variable is actually categorical (we need as a Factor in R).

```{r echo=FALSE, Univariate_Plots_Factor_Quality}
# making quality a factor
red$quality_cat <- factor(red$quality)
str(red$quality_cat)
table(red$quality_cat)
```

We are sure the **quality** variable is categorical and we can continue exploring it more in detail.

```{r echo=FALSE, Univariate_Plots_Quality_Bar}
# bar plot of quality
ggplot(aes(x = quality_cat, fill = quality_cat), data = red) +
  geom_bar(stat = "count") +
  scale_fill_brewer(type='seq') +
  theme(panel.background = element_rect(fill='black'), panel.grid = element_blank())
```

From this plot we can tell the quality 5 is the most frequent and it is closely followed by quality 6. On the other hand, we have 3 and 8 as the least frequent. This plot will help us understand why we might have more medium quality wines in our future plots.

There is one variable that stands out at a quick glance in the summary. The **density** seems to have a very tiny difference between minimum, median and maximum values. 

```{r echo=FALSE, Univariate_Plots_Density}
# Summary of the density variable
summary(red$density)
```


```{r echo=FALSE, warning=FALSE, Univariate_Plots_Density_Hist}
# Histogram of quality
ggplot(aes(x = density), data = red) +
  geom_histogram(bins=30)
```

Just as expected the density is mainly between 0.995 and 1 $g/cm^3$, which seems an indication of all wines having similar density.

Once we explored quality and density it might be good to look at the other variables.

```{r echo=FALSE, warning=FALSE, Univariate_Plots_All_Hist}
# Histograms of all the variables in the dataset that are numeric
red[,-1] %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(bins=30)
```

From this plot, we can see that density is plot just as we did before. This can confirm our plot to be correct. 

Also, we can go further an explore a couple variables more in detail.

```{r echo=FALSE, Univariate_Plots_FixedAcidity}
# Histogram of fixed acidity
ggplot(aes(x = fixed.acidity), data = red) +
  geom_histogram(binwidth=1)

```

```{r echo=FALSE, Univariate_Plots_FixedAcidity_Summary}
# Summary of the fixed acidity variable
summary(red$fixed.acidity)
```

The fixed acidity peaks in the range 7 to 9 $g/dm^3$. Since we know that the most frequent quality is 5 and 6, this might be an indication that fixed acidity levels 7 to 9 could be the quality range 5 to 6. Following the same intuition, we can think that the least frequent values in the histogram can be either higher or lower quality. In the next section, we will need to investigate this further using two variables.


Now, we can check the volatile acidity

```{r echo=FALSE, Univariate_Plots_VolatileAcidity}
# Histogram of volatile acidity
ggplot(aes(x = volatile.acidity), data = red) +
  geom_histogram(binwidth=0.02)

```

```{r echo=FALSE, Univariate_Plots_VolatileAcidity_Table}
# Table of the volatile acidity variable
table(round(red$volatile.acidity,1))
```

If we use fine bins for the **volatile.acidity** histogram we can see two or three trends at 0.4, 0.5 and 0.6. If we follow the higest two peaks at 0.4 and 0.6, we can imagine them to be related to the most frequent quality of wines, so we basically can think that volatile acidity in these peaks is mainly related to quality 5-6. To confirm this we will need a more ellaborated plot with two variables.

For the time being, we can continue to explore other variables such as **citric.acid**

```{r echo=FALSE, Univariate_Plots_CitricAcid_Hist}
# Histogram of citric acid
ggplot(aes(x = citric.acid), data = red) +
  geom_histogram(binwidth=0.005)
```
```{r echo=FALSE, Univariate_Plots_CitricAcid_Table}
# Table of the citric acid variable
table(round(red$citric.acid,2))
```

From a fine histogram the **citric.acid** seems not to be a clear contributing factor to a red wine quality. However, the fact that there is still more than one peak makes us doubt if each peak can be related to a group of quality.

```{r echo=FALSE, Univariate_Plots_Residual_Sugars}
# Histogram of residual sugars
ggplot(aes(residual.sugar), data = red) +
  geom_histogram(bins=30)

```

```{r echo=FALSE, Univariate_Plots_ResidualSugar}
# Summary of the residual sugar variable
summary(red$residual.sugar)
```

Surprisingly, most wines have low **residual.sugar**, and it could be that good quality is associated to extremly low or high residual sugar. This might be a good variable to help us distinguish low quality vs good quality wines.

Finally, let's check another left tailed distribution that according to the name seems to be related to the wine quality. I am referring to alcohol

```{r echo=FALSE, Univariate_Plots_Alcohol}
# Histogram of alcohol
ggplot(aes(alcohol), data = red) +
  geom_histogram(bins=10)

```

```{r echo=FALSE, Univariate_Plots_Alcohol_Summary}
# Summary of the alcohol variable
summary(red$alcohol)
```

This histogram seems to have a peak between 9.5 and 10.5, but a very low level between 8.4 and 9.5. Same happens the higher the alcohol level gets, the smaller the bin height gets. This can be an indication of alcohol associated to the quality as the low and high quality wines are few. 

  > <font size="2">Note: We will need to explore more, but it seems skewed distributions might be related to quality.</font>

***

# Univariate Analysis

This univariate analysis was the first step on the exploration and to get familiar with the data. Basically, some histograms were performed to understand the distributions of the features and also to understand what are the most frequent quality grades of red wines in the dataset.

### What is the structure of your dataset?

The dataset is tidy and contains 1,599 red wines records with 12 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). However, the quality of the wines in the dataset only range from 3 to 8 since there are no 0,1,2,9 nor 10 graded wines in our data.

### What is/are the main feature(s) of interest in your dataset?

**Quality** is the main interest variable. Our goal is to figure out which elements contribute to the quality of a wine.

From our exploration I could tell that the quality has mainly 5 or 6 grade. Using some intuition at this point, we might consider that tailed histograms can be features that we want to consider as there is more information on the 5 and 6 grade compared to lower and higher red wine quality.

This idea also applies to distributions that seem  bimodal or have more than one peak such as citric acid. In my opinion, this distributions might also have a hidden pattern related to quality and that might be the reason of having more than one mode.

Thus, after our histograms, some variables that seem promising when understanding quality are:

* Citric acid
* Fixed acidity
* Free sulfur dioxide
* Volatile acidity
* Total sulfur dioxide
* Alcohol
* Residual sugar
* Chlorides
* Sulphates


### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?

I also explored the density and it seems that it can be different for all wines, but the intersting part wil be to explore if the tails of the distribution contains elements from all wine qualities of if they are related to lower and higher quality wines.

### Did you create any new variables from existing variables in the dataset?

No, there was not a need for now to create a new variable.

### Of the features investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?

In the histograms, we found:

**Normal or very close to normal distributions**

  * pH
  * Density

**Left tailed distributions**

  * Total Sulfur Dioxide
  * Alcohol
  * Fixed Acidity
  * Free Sulfur Dioxide
  * Volatile Acidity
  * Chlorides
  * Sulphates
  * Residual sugar
  
**Bimodal or not normal distributions**

  * Citric Acid
  
It seems that the not normal distributions might be more related to the quality since we have a higher number of quality 5 and 6 wines. This can mean that the tails can explain the lowest and highest quality wines on the dataset.

Luckily this dataset was made from tidy data and the only variable that needed a type change was quality. This was to make it a factor and have it as a true categorical variable rather than numerical.

***

# Bivariate Plots Section

In the previous section, we use our intuition to choose some relevant variables, so it will be a good idea to find what is the correlation between all variables in the dataset to narrow our exploration and remove possible colinearities.

The first thing to start our bivariate analysis will be to check which variables might have a correlation with each other. For this we will use pearson's correlation coefficient.

```{r echo=FALSE, Bivariate_Plots_Corr_Coef}
# Correlation Matrix
cols = length(red)
mcor <- cor(red[,2:cols-2])
round(mcor, 2)

```

Since data is hard to visualize from the correlation matrix, we will plot it.

```{r echo=FALSE, Bivariate_Plots_Corr_Plot}
# Plot of the correlation matrix
corrplot(mcor, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

```

From the plot we can easily see what variables are related and choose which ones to analyze:

Chlorides and sulphates are correlated so choosing only one of them might be good to start our analysis. Thus, I will add **sulphates** to our list.

The same happens for fixed acidity and citric acid. However, **citric acid** is a variable with a strange distribution, so we better keep it to explore how this relates to quality.

Citric acid seems also correlated to volatile acidity. However, volatile acidity is not correlated to fixed acidity. It might be good to keep **volatile acidity** and explore how it can contribute to quality.

Total sulfur dioxide and free sulfur dioxide seem correlated, but not with anything else. Thus, I will choose the **total sulfur dioxide** to investigate it.

We have already decided to keep citric acid, so we can keep **alcohol** despite it is a bit correlated, this is mainly to test if alcohol level actually impacts a wine quality.

  > <font size="2">Note: Wine is alcohol, so my curiousity drives me to understand if the alcohol level matters.</font>

Finally, from our promising features, we have the **residual sugar** which seems not correlated to anything. it also had a slightly left tailed distribution, so it might be worthy to check how it affects quality or if it is neutral.

Once we have chosen these features, we had also decided to understand if **density** played part on the quality, so we will also explore it.

Now, we can generate plots to inspect the variables we chose to explore to find insights about the relationship of them to **quality**.

```{r echo=FALSE, warning=FALSE, Bivariate_Plots_All_Vars}
# Plot of relevant variables
interest = c('sulphates','density','citric.acid','alcohol','volatile.acidity','total.sulfur.dioxide', 'residual.sugar')
pairs.panels(red[,interest],bg=c("yellow","blue")[(red > 10)+1],pch=21,
             main = 'Relation between Relevant Variables')
```

From the plot, we can see that alcohol and density seem to show the expected slight correlation we had before. Same applies to volatile acidity and citric acid. Besides those variables everything else seems to have almost no correlation which can be a good indicator to understand the independent variable that relate to quality.

  > <font size="2">Note: Density is actually a variable with most of the slight correlations in the plot, so if we find density to be not very important, we can remove it. it also has a normal distribution which can be a reason of it correlating slightly with other variables.</font>

We can explore further relations between quality and variables. let's start from **density** since we noticed most wines have similar densities.


## Density

```{r echo=FALSE, Bivariate_Plots_Density_vs_Quality}
# scatter plot of density versus quality
ggplot(aes(x = quality, y = density), data = red) +
  geom_jitter(alpha = 1/2)
```

We can see that there seems no correlation between density and quality. However, because it seems symetric, I applied some math to the density and plot it again

```{r echo=FALSE, Bivariate_Plots_abs_log_density_vs_Quality}
# scatter plot of density versus quality
ggplot(aes(x = quality, y = abs(log(density))), data = red) +
  geom_jitter(alpha = 1/2)
```

After applying a log to the density and then obtaining the absolute value of that operation, we can confirm that wines with a higher quality seems to have a few high/low density outliers, while low quality wines have no high/low density outliers. However, most wines have a desnity in between 0.995 and 1.0 $g/cm^3$. This does feature seems to be slighly contributing to the quality.

```{r echo=FALSE, Bivariate_Plots_Density_Quality_Boxplot}
# Boxplot of Quality and Alcohol
ggplot(aes(y = density, x = quality_cat ), data = red) +
  geom_boxplot(aes(fill = quality_cat)) +
  scale_fill_brewer(type = 'seq')
```

```{r echo=FALSE, Bivariate_Plots_Density_Quality_Summary}
# Density distributions by quality levels
by(red$density, red$quality, summary)
```

The box plot shows that there is no much difference in density between wines which is what we noticed before in the histogram. However, the scatter plot and statistics show us that there is a slight difference in wines from quality 7 to 8 and here we can see that the higher quality wines tend to have slightly lower density on average.


## Alcohol

Another variable that caught my attention only because of the name is alcohol. A wine is alcohol so we should check what is the relation between alcohol and quality.

```{r echo=FALSE, Bivariate_Plots_Alcohol_vs_Quality}
# scatter plot of alcohol versus quality
ggplot(aes(x = quality, y = alcohol), data = red) +
  geom_jitter(alpha = 1/2)
```

From the plot it seems that the higher the alcohol level the more quality a wine tends to have.

```{r echo=FALSE, Bivariate_Plots_Alcohol_Quality_Boxplot}
# Boxplot of Quality and Alcohol
ggplot(aes(y = alcohol, x = quality_cat ), data = red) +
  geom_boxplot(aes(fill = quality_cat)) +
  scale_fill_brewer(type = 'seq')
```

```{r echo=FALSE, Bivariate_Plots_Alcohol_Quality_Summary}
# Alcohol distributions by quality levels
by(red$alcohol, red$quality, summary)
```

The boxplot ans statistics confirm that from level 5, the average amount of alcohol increases as quality increases and it is oscillating while quality decreases.


## Volatile Acidity

Since we noticed that boxplots were easier to understand since **quality** is a categorical variable, we will follow analysis doing boxplots. The next variable is volatile acidity.

```{r echo=FALSE, Bivariate_Plots_VolatileAcidity_Quality_Boxplot}
# Boxplot of Quality and Volatile Acidity
ggplot(aes(y = volatile.acidity, x = quality_cat ), data = red) +
  geom_boxplot(aes(fill = quality_cat)) +
  scale_fill_brewer(type = 'seq')
```

Let's do some math to complement the boxplots.

```{r echo=FALSE, Bivariate_Plots_VolatileAcidity_Quality_Summary}
# Volatile acidity distributions by quality levels
by(red$volatile.acidity, red$quality, summary)
```

Interestingly enough the median volatile acidity constantly reduces as the wine **quality** grade increases.

At this point the **alcohol** and **volatile acidity** seem to be related to wine quality.


## Sulphates

Another variable to analyze is sulphates

```{r echo=FALSE, Bivariate_Plots_Sulphates_Quality_Boxplot}
# Boxplot of Quality and Sulphates
ggplot(aes(y = sulphates, x = quality_cat ), data = red) +
  geom_boxplot(aes(fill = quality_cat)) +
  scale_fill_brewer(type = 'seq')
```

```{r echo=FALSE, Bivariate_Plots_Sulphates_Quality_Summary}
# Sulphates distributions by quality levels
by(red$sulphates, red$quality, summary)
```

The sulphates also seem to have a correlation to the **quality** as their mean and median increase the higher the quality of a red wine. However, we can notice that the median and average values of higher qualities also exist in the third and max values of lower quality wines. This makes sure that there must be another variable that can help us define quality. In fact, we have **alcohol** and **volatile acidity** as possible variables to help define quality and now we add **sulphates** to the list.


## Citric Acid

If the alcohol and volatile acidity had a correlation to the wine quality, it will be interesting to check the citric acidity, which is correlated to volatile acidity from what we saw in the correlation matrix.

```{r echo=FALSE, Bivariate_Plots_CitricAcid_Quality_Boxplot}
# Boxplot of Quality and Citric Acid
ggplot(aes(y = citric.acid, x = quality_cat ), data = red) +
  geom_boxplot(aes(fill = quality_cat)) +
  scale_fill_brewer(type = 'seq')
```

```{r echo=FALSE, Bivariate_Plots_CitricAcid_Quality_Summary}
# CitricAcid distributions by quality levels
by(red$citric.acid, red$quality, summary)
```

We noticed an increasing trend on the average citric acid as well as the quality, this is the opposite as the volatile acidity. In other words, the more citric acid the more quality a wine has and this makes perfect sense since citric acid and volatile acidity have a negative correlation.


## Total Sulfur Dioxide

```{r echo=FALSE, Bivariate_Plots_TotalSulfurDioxide_Quality_Boxplot}
# Boxplot of Quality and Total Sulfur Dioxide
ggplot(aes(y = total.sulfur.dioxide, x = quality_cat ), data = red) +
  geom_boxplot(aes(fill = quality_cat)) +
  scale_fill_brewer(type = 'seq')
```

```{r echo=FALSE, Bivariate_Plots_TotalSulfurDioxide_Quality_Summary}
# Total Sulfur Dioxide distributions by quality levels
by(red$total.sulfur.dioxide, red$quality, summary)
```

The total sulfur dioxide does not have a clear correlation to the quality of a wine since its average values peak at quality level 5, but decreases as quality moves away from level 5.


## Residual Sugar

Finally, we can analyze the residual sugar and how it is related to quality. There are some wines that are considered sweeter than others. Could this also be realted to quality?

```{r echo=FALSE, Bivariate_Plots_ResidualSugar_Quality_Boxplot}
# Boxplot of Quality and Residual Sugar
ggplot(aes(y = residual.sugar, x = quality_cat ), data = red) +
  geom_boxplot(aes(fill = quality_cat)) +
  scale_fill_brewer(type = 'seq')
```

```{r echo=FALSE, Bivariate_Plots_ResidualSugar_Quality_Summary}
# Residual Sugar distributions by quality levels
by(red$residual.sugar, red$quality, summary)
```

Apparently the residual sugar of a wine does not define the quality of a wine since it average values oscillate from quality level 3 to 8. Maybe that can explain why wines that are sweeter can be also have high quality. 

Once we have analyzed our desired variables, we can make some conclusions out of the plots.

***

# Bivariate Analysis

From the variables we chose, we found that 4 or 5 of them seem to be contributing to the quality of a red wine. The order in which, I would rank them (1 being the highest noticeable contribution):

1. Volatile acidity
2. Sulphates 
3. Citric acid
4. Alcohol
5. Density

### How did the feature(s) of interest vary with other features in the dataset?

In this bivariate analysis we found the correlation matrix between our variables, we left the quality out of the matrix as it is a categorical variable.

From the correlation matrix, we reduce the number of promising features to explore based on the variables' correlation. 

  > <font size="2">Note: pH was correlated to almost all our variables.</font>

Soemthing we mentioned before is that the features of interest had a left tailed distribution or completely random. This was also seen in the boxplots of each feature vs quality. Moreover, with the correlation matrix we narrow the features of interest even further.


### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

We had decided to review the density since we thought it might contribute to define highest and lowest quality wines. It in fact was confirmed that density slightly contribute to quality since the higher the quality of a wine the slightly lower density the wine has. This was first shown with a scatter plot and then better appreciated with a box plot (which in practice are better in this cases as I use a categorical variable vs a numerical variable).

### What was the strongest relationship you found?

In order to determine the strongest relationship we used box plots for all the chosen features of interest selectd after the correlation matrix. the analysis was as follow:

From the more promising features chosen, we started with alcohol vs quality. Since wine is alcoholic, this might affect wine taste in a huge proportion. We found that indeed the higher the level of alcohol the higher the quality of the red wine.

Then, we checked volatile acidity and we found that it constantly reduces while the **quality** grade of a wine gets higher.

From there, we did a boxplot of sulphates and they showed to have a trend related to **quality**. The sulphates get higher for higher quality wines.

Citric acid was our next variable to analyze and mainly because it had a completely different distribution compared to other features. Something interesting about citric acid is that it was correlated to volatile acidity. Since both turned to be related we were not surprised that citric acid that for quality wines 5 to 8 the more citric acid the higher the quality of a wine.

Finally, the total sulfur dioxide and the residual sugar didn't show a clear correlation to the quality and no trends were noticed.


If we sort these features of interest and our additional variable which was density, we ended up with the following list which is ordered from strongest to weakest relationship with quality of the variable listed:

1. Volatile acidity - Negative. Lower volatile acidity, higher quality
2. Sulphates - Positive. Higher sulphates, higher quality
3. Citric acid - Positive. Higher citric acid, higher quality. [Negative correlation with Volatile acidity]
4. Alcohol - Positive. Higher alcohol level, higher quality
5. Density - Negative. Lower density, higher quality. [Negative correlation with Alcohol]


However, we can notice that the last two variables citric acid and density are correlated to volatile acidity and alcohol, respectively. This might mean we need to drop these two variables. In any situation, these two variables were left only for curiosity and exploration.

***

# Multivariate Plots Section

We start plotting the variables that showed a trend in relation to quality.

```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_Sulphates_Quality}
# Plot of volatile acidity and sulphates
ggplot(aes(x = volatile.acidity, y = sulphates , color = quality_cat, fill = quality_cat), data = red) +
  geom_hex(alpha=0.7, size = 2) +
  scale_fill_brewer( type = 'div') +
  scale_colour_brewer( type = 'seq') +
  theme(panel.background = element_rect(fill='darkgray'))

```

Let's divide this plot in plots per quality category

```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_Sulphates_Quality_FacetWrap}
# Plot of volatile acidity and sulphates with facet wrap of quality
ggplot(aes(x = volatile.acidity, y = sulphates , color = quality_cat, fill = quality_cat), data = red) +
  geom_hex(alpha=0.9, size = 2) +
  scale_fill_brewer( type = 'div') +
  scale_colour_brewer( type = 'seq') +
  theme(panel.background = element_rect(fill='darkgray')) +
  facet_wrap(~ quality_cat)

```

We can see from the plots that there are some areas where the quality levels get clustered. Let's check the volatile acidity and sulphates counts of wines per quality.

Starting with volatile acidity

```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_Quality_Table}
# Quality per volatile acidity levels
by(red$quality_cat, round(red$volatile.acidity,1), table)
```

Then, we do sulphates counts

```{r echo=FALSE, Multivariate_Plots_Sulphates_Quality_Table}
# Quality per sulphates levels
by(red$quality_cat, round(red$sulphates,1), table)
```

Surprisingly, the sulphates and the volatile acidity form clusters of the red wine quality levels. The sulphates range [0.6,0.9] and the volatile acidity range [0.3,0.6] contain the higher number of high quality wines. This might be something very insightful to determine the quality of a red wine and these two variables might be what we are interested the most.

Let's analyze these two variables (volatile acidity and sulphates) and compare each of them against the citric acid variable, which from level 5 quality showed a trend that higher the citric acid, the higher the quality.

We will start with the sulphates and the citric acid plot

```{r echo=FALSE, Multivariate_Plots_Sulphates_CitricAcid_Quality}
# plot of variables sulphates and citric acid
ggplot(aes(x = sulphates, y = citric.acid, 
           color = quality_cat, fill = quality_cat), data = red) +
  geom_hex(size=2) +
  scale_fill_brewer(palette = 'Greens') +
  scale_color_brewer(type='div') +
  theme(panel.background = element_rect(fill='gray'))
```

Surprinsingly, the citric acid and sulphates do not cluster the quality levels like the volatile acidity and sulphates did. Despite citric acid being negatively correlated to volatile acidity.

Let's continue the analysis comparing volatile acidity and citric acid. 

```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_CitricAcid_Quality}
# plot of variables volatile acidity and citric acid
ggplot(aes(x = volatile.acidity, y = citric.acid, color = quality_cat, fill = quality_cat), data = red) +
  geom_hex(size=2) +
  scale_fill_brewer(palette = 'Greens') +
  scale_color_brewer(type='div') +
  theme(panel.background = element_rect(fill='gray'))
```

```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_CitricAcid_Correlation}
# correlation of volatile acidity and citric acid
cor(red$volatile.acidity, red$citric.acid, method = "spearman")
```

After comparing the volatile acidity and the citric acid we can see a slight negative correlation betwen these two variables. However, the 8 level quality values are very sparse. The level 7 seem to be clustered in two blobs, but it is pretty simialr to what we saw in the citric acid and sulphates plot. Thus, we might have citric acid influecing a little, but not a major factor. Also, we have volatile acidity which is slightly correlated to citric acid, so if we were to build a model we rather take volatile acidity and sulphates as of now.


Once we have checked the first three variables we saw show clearer trends related to quality, we cna move to analyze our last two variables: alcohol and density. 
Let's move forward in the analysis using the alcohol variable and compare it against out two main variables.

As a reminder, the alcohol showed a trend in which higher alcohol, the higher the quality.

Let's start comparing alcohol and volatile acidity.

```{r echo=FALSE, Multivariate_Plots_Alcohol_VolatileAcidity_Quality}
# plot of alcohol and volatile acidity
ggplot(aes(x = alcohol, y = volatile.acidity, color = quality_cat, fill = quality_cat), data = red) +
  geom_hex(size=2) +
  scale_fill_brewer(type='seq') +
  scale_color_brewer(type = 'seq') +
  theme(panel.background = element_rect(fill='darkgray')) +
  facet_wrap(~quality_cat)
```

```{r echo=FALSE, Multivariate_Plots_Alcohol_VolatileAcidity_Quality_Table}
# Quality per alcohol levels
by(red$quality_cat, round(red$alcohol), table)
```

In the plot, it seems that the lower the volatile acidity and the lower the alcohol the quality tends to be good. However, the higher the alcohol gets, it allows the volatile acidity to go higher and still get some good quality wines.

The alcohol range for which we get higher qualities goes from 10 to 14 % volume. While the volatile acidity as we saw in previous plots goes from 0.3 to 0.6 $g/dm^3$.

Once we compared the alcohol to the volatile acidity, let's move to compare to the sulphates

```{r echo=FALSE, Multivariate_Plots_Alcohol_Sulphates_Quality}
# Plot of strongest variable alcohol and the third strongest sulphates
ggplot(aes(x = alcohol, y = sulphates, color = quality_cat, fill = quality_cat), data = red) +
  geom_hex(alpha=0.85, size = 8) +
  scale_fill_brewer( palette = 'Purples') +
  scale_colour_brewer(palette = 'Purples') +
  theme(panel.background = element_rect(fill='darkgray'))

```

This plot shows that the higher quality wines are in a specific range of sulphates and alcohol values. We can do some counting of wines per quality according to alcohol and sulphates to find the ranges where best quality wines live.

Starting with the alcohol and quality counts:

```{r echo=FALSE, Multivariate_Plots_Alcohol_Quality_Table}
# Quality per Alcohol levels
by(red$quality_cat, round(red$alcohol), table)
```

Then, we move to sulphates and quality counts:

```{r echo=FALSE, Multivariate_Plots_Sulphates_Quality_Table2}
# Quality per Sulphates levels
by(red$quality_cat, round(red$sulphates,1), table)
```

From the plot and statistics, we can see that it is more common to have good wines after 10 % volume alcohol(except on 15% vol) and the higher the alcohol level, more chances of a grade 8 wine. Also, it seems grade 8 wines mainly appear when sulphates are in the range of 0.6 and 0.9.


Once we have compared the alcohol with sulphates and found not a very strong pattern, we can plot alcohol versus citric acid.

```{r echo=FALSE, Multivariate_Plots_Alcohol_CitricAcid_Quality}
# plot of variables volatile acidity and citric acid
ggplot(aes(x = alcohol, y = citric.acid, color = quality_cat, fill = quality_cat), data = red) +
  geom_hex(size=2) +
  scale_fill_brewer(palette = 'Greens') +
  scale_color_brewer(type='div') +
  theme(panel.background = element_rect(fill='gray'))
```

The higher level quality wines are very sparse, so these variables together might not be good indicator of quality.

Finally, let's explore the alcohol and density since they show a correlation in our correlation matrix.

```{r echo=FALSE, Multivariate_Plots_Alcohol_Density_Quality}
# plot of variables alcohol and density
ggplot(aes(x = alcohol, y = density, color = quality_cat, fill = quality_cat), data = red) +
  geom_hex(alpha=0.7, size=2) +
  scale_fill_brewer(palette ='Purples') +
  scale_color_brewer(palette ='Purples') +
  theme(panel.background = element_rect(fill='darkgray'))
```

Despite alcohol and density are slightly correlated, they together do not explain the high quality wines.

Before making conclusions about the multivariate analysis, it might be interesting to see if the volatile acidity per % volume of alcohol actually helps alcohol play part in this analysis.

```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_per_Alcohol_vs_Sulphates_Quality}
# Plot of volatile acidity and sulphates
ggplot(aes(x = volatile.acidity/alcohol, y = sulphates , color = quality_cat, fill = quality_cat), data = red) +
  geom_hex(alpha=0.7, size = 2) +
  scale_fill_brewer( type = 'div') +
  scale_colour_brewer( type = 'seq') +
  theme(panel.background = element_rect(fill='darkgray'))

```

Surprisingly dividing the volatile acidity by the % volume of alcohol made the higher quality wines come closer and be clustered easier than without dividing the volatile acidity with alcohol.

We can compare this plot against the first plot on this multivariate analysis if we use some numbers.

```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_per_Alcohol_Quality_Table}
# Quality per volatile acidity by alcohol levels
by(red$quality_cat, round(red$volatile.acidity/red$alcohol,2), table)
```

From the table in here and what we learned on the table fo the first plot of the section, we can see that when we divided with alcohol the volatle acidity we definitively helped to make high quality wines come together. In fact, in the first plot of volatile acidity versus sulphates, our maximum number of wines of quality 8 was 8 and it happened when volatile acidity was 0.4 $g/dm^3$. On the other hand, when we plotted volatile acidity by alcohol versus sulphates, the maximum number of quality 8 wines was 12 and it was present when volatile acidity per % alcohol was 0.03 $g/dm^3$ per % vol of alcohol. This defiitively shows a better cluster for quality of wines.


If we want to go a bit further we can try to measure correlation for `cor( volatile acidity, sulphates)` and `cor( volatile acidity per % volume of alcohol, sulphates)`. 

  > <font size="2">Note: We only need numbers for the volatile acidity divided by alcohol since sulphates stayed the same.</font>

```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_per_Alcohol_Sulphates_Corr}
# Quality per sulphates levels
cor(red$volatile.acidity, red$sulphates,method = "spearman")
cor(red$volatile.acidity/red$alcohol, red$sulphates, method= "spearman")
```

We can see the first correlation coefficient to be closer to 0 and that makes volatile acidity and sulphates less correlated than volatile acidity per alcohol and sulphates.

I still see they are not so correlated, but this is more of a classification problem, so mutliple logistic regression would be a better candidate to build a model in the future.

***

# Multivariate Analysis

### Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?

It was surprising to find that sulphates in the range of 0.6 to 0.9 $g/dm^3$ and volatile acidity range of 0.6 and 0.9 $g/dm^3$ contain the highest quality wines. It was also interesting to find that these two chemical properties are making clusters of the quality levels. This all makes sense since quality is a category and we are facing a classification problem.

### Were there any interesting or surprising interactions between features?

Finding that volatile acidity and sulphates are very promising to find the quality of a wine was surprising, but it was even more interesting to find that when we divided the volatile acidity by the alcohol and plot that versus the sulphates gave even a better relation of variables to find the chemicals that can be driving the quality of a wine.

In fact, the cluster of quality 8 wines was 50% greater when volatile acidity was divided by alcohol than by leaving it alone.

This really exemplified the importance of combining features.

***

# Final Plots and Summary

### Plot One
```{r echo=FALSE, Plot_One}
# bar plot of quality
ggplot(aes(x = quality_cat, fill = quality_cat ), data = red) +
  geom_bar(stat = "count") +
  scale_fill_brewer(type='seq') +
  ggtitle("Quality of Wines") +
  labs(x = "Quality Level", y = "Number of Wines",
       caption = '\nNote: Wine quality goes from 0 (lowest) to 10 (highest)') +
  theme(panel.background = element_rect(fill='black'), 
        panel.grid = element_blank(),
        legend.position = "None", 
        plot.title = element_text(face = "bold", hjust = 0.5,size = 18),
        axis.title.x = element_text(color = 'black'),
        axis.title.y = element_text(color = 'black'),
        plot.caption = element_text(hjust = 1, size = 9, color = 'darkgray'))
```

### Description One
This was the first plot we made and it helped us to understand the amount of wines per quality category that we have in our dataset. Clearly, we can see that we have more quality 5 and 6 wines than any other category of wines. The fact that we know that allowed us to understand that it will be completely fine that chemical components will not show normal distributions. In other words, we will have many skewed distributions and these might be the ones we take a look at to find promising variables that explain the quality of a red wine.


### Plot Two
```{r echo=FALSE, Plot_Two}
# Boxplot of Quality and Volatile Acidity
p1 <- ggplot(aes(y = volatile.acidity, x = quality_cat ), data = red) +
  geom_boxplot(aes(fill = quality_cat)) +
  scale_fill_brewer(type = 'seq') +
  labs(title='Volatile Acidity by Quality',
       x = '[Low]                     Quality                     [High]',
       y = 'Volatile Acidity [g/dm^3]') +
  theme(legend.position = "None", 
        plot.title = element_text(face = "bold", hjust = 0.5,size = 15),
        axis.title.x = element_text(color = 'black', size = 10),
        axis.title.y = element_text(color = 'black', size = 10))

# Boxplot of Quality and Sulphates
p2 <- ggplot(aes(y = sulphates, x = quality_cat ), data = red) +
  geom_boxplot(aes(fill = quality_cat)) +
  scale_fill_brewer(type = 'seq') +
  labs(title='Sulphates by Quality',
       x = '[Low]                     Quality                     [High]',
       y = 'Sulphates [g/dm^3]') +
  theme(legend.position = "None", 
        plot.title = element_text(face = "bold", hjust = 0.5,size = 15),
        axis.title.x = element_text(color = 'black', size = 10),
        axis.title.y = element_text(color = 'black', size = 10))

# Putting together the plots in a single image
grid.arrange(p1,p2,ncol=2)

```

### Description Two
Once we analyzed variables distributions by quality, there were more details about chemical properties being more related to quality levels. The box plots allowed to discover two important checmical properties: Volatile acidity and sulphates.
The box plots picture that volatile acidity $g/gm^3$ decrease the higher the quality a wine has. While the sulphates $g/dm^3$ increase in higher quality wines.
The fact that we found these two variables helped understand that there are some checmical properties that can explain the quality of a red wine.


### Plot Three
```{r echo=FALSE, Plot_Three}
# Plot of volatile acidity per alcohol and sulphates
ggplot(aes(x = volatile.acidity/alcohol, y = sulphates , color = quality_cat, 
           fill = quality_cat), data = red) +
  geom_hex(alpha=0.8, size = 2) +
  coord_cartesian(ylim=c(0.3, 1.4), xlim=c(0.008,0.13)) +
  scale_fill_brewer( type = 'div') +
  scale_colour_brewer( type = 'seq') +
  labs(title = 'Quality of Wines Clustered by \nSulphates and Volatile Acidity per Alcohol % Vol', x = 'Volatile Acidity per % Alcohol [g/dm^3]' , y =  'Sulphates [g/dm^3]',
  fill = 'Wine Quality', colour = 'Wine Quality',
  caption = '\nNote: Wine quality goes from 0 (lowest) to 10 (highest)') +
  theme(panel.background = element_rect(fill='darkgray'),
        plot.title = element_text(hjust = 0.5, face = 'bold'),
        legend.title = element_text(hjust = 0.5, size = 9, face = 'bold'),
        legend.justification = "right",
        plot.caption = element_text(hjust = 1, size = 9, color = 'darkgray'))

```

### Description Three
This plot shows how the different levels of quality of a red wine overlap and at the same time it shows clusters of the levels of quality. This is very important when identifying the quality of a specific wine since this is a classification problem. 
The fact that we see clusters using volatile acidity per alcohol and sulphates means that these chemical properties are good indicators of how a red wine quality. 
We can also see some outliers in the data, but mainly we see groups of our quality levels. One can also appreciate quality 5 and 6 clusters spread all over the plot which confirms what we learned from our dataset at the very beggining.
All in all, this plot can be very useful to see the different quality levels and in which ranges of $g/dm^3$ of volatile acidity per % alcohol and sulphates one can have higher quality wines.

***

# Reflection

In this project, I happened to learn a lot about red wine quality and the checmical properties that are measure to give a wine a quality grade. 

The hardest part of the analysis was to first start understanding the data and try to find something useful to start with and continue unveiling more and more in new analysis. 

While plotting univariate distributions it was very hard to determine if a variable might be related to an important finding, but using the intuition was key and challenging. Later on, during bivariate analysis it was a bit easier to decide which variables to choose based on the box plots the trends seen with quality, but the possible plots to explore and relations to find were too many to even consider them all. At that point I just had the idea to explore correlated variables and start from simple to have a strong decision based on simple relations rather than very complicated ones. However, I felt I was approaching a dead end when a chemical property showed no relation to quality. It was challenging that things that made sense for me such as alcohol related to a wine quality were actually not really true when exploring the data. However, I was successful in finding the volatile acidity and the sulphates as important variables that seemed to follow a trend related to quality levels. From there things became a bit easier and I knew exactly what I wanted to do at the very beggining of multivariate analysis. In this final analysis, I started plotting the volatile acidity versus the sulphates and coloring them by quality to see if there were any visible signs of cluster of wine quality. In fact, there were clusters, a bit spread but decent enough to see them. I continue to a dead end in which no other chemical components showed to form a cluster or explain quality. That was when I decided that my first plot of volatile acidity and sulphates was the best I have gotten so far. Then, it was a day later that I still not believe that alcohol % volume was not related to quality since wine is alcohol. That lead me to divide volatile acidity over alcohol and I found something that blew my mind once I plotted such variables. The clusters I have found in the plot of volatile acidity versus the sulphates were now tighter. This finding was in fact a reward of the hard effort to figure out if other variables that logically seemed to be connected to the quality of a wine, specially alcohol, were actually connected.

All in all, I had a great experience anayzing the data, struggling and figuring out a new way when things didn't look good in the path I was taking. Other times, I just had explore until I found something useful to move further with such finding. This project definitively taught me how to approach an EDA to find something interesting and also to trust intution once in a while.

# Future Work

As far as future work with the findings, the best next step will be to create a multi-classifier logistic regression model. I would also be tempted to get more data to get a more balanced dataset in all categories and also include quality levels 0, 1, 2, 9 and 10. This will definitively benefit the model we build and we might even have to do another quick exploration to determine our current chemical properties still explain red wines quality.

# Limitations

The limitations in these analysis are related to data unbalance and a possible bias problem toward the quality 5 and 6 red wines. In here also, we explored one of the relations that seemed strong to explain quality; however, there are many possibilities as this is an open ended problem that someone else can find different relations that can be close to explain quality of a red wine. Another limitation is that the dataset contained 12 variables that are related to a wine, but there might be more such as time that a wine was left to get to point, the type of grapes used, the region form where the wine is, etc. More variables can contribute to determine the quality and we might be looking at only a small set of them. All in all, be aware that this analysis helps understand quality form the current dataset, but not for every type of wine you can find out there in the world.

## Resources

While creating this project there were no special resources used, but the [documentation of R ggplot](https://ggplot2.tidyverse.org/index.html).