-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathRedWine.Rmd
926 lines (632 loc) · 47.7 KB
/
RedWine.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
---
title: "Red Wine Quality Analysis"
author: "by Arturo Parrales Salinas"
date: "8/2/2018"
output: html_document
---
***
***
### Main Research Goal
The variable that interest us the most is **quality** since we want to understand **which chemical properties influence the quality of red wines**.
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
#install.packages('corrplot')
#install.packages('psych')
library(corrplot)
library(purrr)
library(tidyr)
library(ggplot2)
library(psych)
library(gridExtra)
```
We load the data set of red wines quality. This dataset is tidy and contains 1,599 red wines records with 12 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
The 12 variables of the wine are listed below:
```{r echo=FALSE, Load_the_Data}
# Load the Data
red <- read.csv('wineQualityReds.csv')
str(red)
```
We can see a variable X which indicates the index of the record in the dataset. We definitively want to remove X before we move forward
```{r echo=FALSE, Univariate_Plots_Remove_X}
# remove X
red <- red[,-1]
str(red)
```
Once we removed X, we can continue to understand the variables on the dataset.
### Attribute Measure Units of Each Variable
For more information, read [Cortez et al., 2009].
**Input variables** (based on physicochemical tests):
1. **Fixed acidity** (tartaric acid - $g / dm^3$)
2. **Volatile acidity** (acetic acid - $g / dm^3$)
3. **Citric acid** ($g / dm^3$)
4. **Residual sugar** ($g / dm^3$)
5. **Chlorides** (sodium chloride - $g / dm^3$)
6. **Free sulfur dioxide** ($mg / dm^3$)
7. **Total sulfur dioxide** ($mg / dm^3$)
8. **Density** ($g / cm^3$)
9. **pH**
10. **Sulphates** (potassium sulphate - $g / dm3$)
11. **Alcohol** (% by volume)
**Output variable** (based on sensory data):
12. **Quality** (score between 0 and 10)
> <font size="2">Note: All Missing Attribute Values are set as None.</font>
### Description of Attributes
1. **Fixed acidity:** most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2. **Volatile acidity:** the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3. **Citric acid:** found in small quantities, citric acid can add 'freshness' and flavor to wines
4. **Residual sugar:** the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5. **Chlorides:** the amount of salt in the wine
6. **Free sulfur dioxide:** the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7. **Total sulfur dioxide:** amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8. **Density:** the density of water is close to that of water depending on the percent alcohol and sugar content
9. **pH:** describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10. **Sulphates:** a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11. **Alcohol:** the percent alcohol content of the wine
**Output variable** (based on sensory data):
12. **Quality:** the score of a red wine, between 0 (lowest) and 10 (highest)
***
# Univariate Plots Section
In the previous sections, we have an overview of the dataset and here we can start with a summary of the dataset information for each variable.
```{r echo=FALSE, Univariate_Plots_Var_Summary}
# dataset summary
summary(red)
```
Let's start looking at the **quality** summary, we can notice that the lowest quality of red wines was 3 and the maximum was 8. This tell us there are neither very bad quality wines nor very excellent wines in this dataset. Also, we want to make sure our **quality** variable is actually categorical (we need as a Factor in R).
```{r echo=FALSE, Univariate_Plots_Factor_Quality}
# making quality a factor
red$quality_cat <- factor(red$quality)
str(red$quality_cat)
table(red$quality_cat)
```
We are sure the **quality** variable is categorical and we can continue exploring it more in detail.
```{r echo=FALSE, Univariate_Plots_Quality_Bar}
# bar plot of quality
ggplot(aes(x = quality_cat, fill = quality_cat), data = red) +
geom_bar(stat = "count") +
scale_fill_brewer(type='seq') +
theme(panel.background = element_rect(fill='black'), panel.grid = element_blank())
```
From this plot we can tell the quality 5 is the most frequent and it is closely followed by quality 6. On the other hand, we have 3 and 8 as the least frequent. This plot will help us understand why we might have more medium quality wines in our future plots.
There is one variable that stands out at a quick glance in the summary. The **density** seems to have a very tiny difference between minimum, median and maximum values.
```{r echo=FALSE, Univariate_Plots_Density}
# Summary of the density variable
summary(red$density)
```
```{r echo=FALSE, warning=FALSE, Univariate_Plots_Density_Hist}
# Histogram of quality
ggplot(aes(x = density), data = red) +
geom_histogram(bins=30)
```
Just as expected the density is mainly between 0.995 and 1 $g/cm^3$, which seems an indication of all wines having similar density.
Once we explored quality and density it might be good to look at the other variables.
```{r echo=FALSE, warning=FALSE, Univariate_Plots_All_Hist}
# Histograms of all the variables in the dataset that are numeric
red[,-1] %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins=30)
```
From this plot, we can see that density is plot just as we did before. This can confirm our plot to be correct.
Also, we can go further an explore a couple variables more in detail.
```{r echo=FALSE, Univariate_Plots_FixedAcidity}
# Histogram of fixed acidity
ggplot(aes(x = fixed.acidity), data = red) +
geom_histogram(binwidth=1)
```
```{r echo=FALSE, Univariate_Plots_FixedAcidity_Summary}
# Summary of the fixed acidity variable
summary(red$fixed.acidity)
```
The fixed acidity peaks in the range 7 to 9 $g/dm^3$. Since we know that the most frequent quality is 5 and 6, this might be an indication that fixed acidity levels 7 to 9 could be the quality range 5 to 6. Following the same intuition, we can think that the least frequent values in the histogram can be either higher or lower quality. In the next section, we will need to investigate this further using two variables.
Now, we can check the volatile acidity
```{r echo=FALSE, Univariate_Plots_VolatileAcidity}
# Histogram of volatile acidity
ggplot(aes(x = volatile.acidity), data = red) +
geom_histogram(binwidth=0.02)
```
```{r echo=FALSE, Univariate_Plots_VolatileAcidity_Table}
# Table of the volatile acidity variable
table(round(red$volatile.acidity,1))
```
If we use fine bins for the **volatile.acidity** histogram we can see two or three trends at 0.4, 0.5 and 0.6. If we follow the higest two peaks at 0.4 and 0.6, we can imagine them to be related to the most frequent quality of wines, so we basically can think that volatile acidity in these peaks is mainly related to quality 5-6. To confirm this we will need a more ellaborated plot with two variables.
For the time being, we can continue to explore other variables such as **citric.acid**
```{r echo=FALSE, Univariate_Plots_CitricAcid_Hist}
# Histogram of citric acid
ggplot(aes(x = citric.acid), data = red) +
geom_histogram(binwidth=0.005)
```
```{r echo=FALSE, Univariate_Plots_CitricAcid_Table}
# Table of the citric acid variable
table(round(red$citric.acid,2))
```
From a fine histogram the **citric.acid** seems not to be a clear contributing factor to a red wine quality. However, the fact that there is still more than one peak makes us doubt if each peak can be related to a group of quality.
```{r echo=FALSE, Univariate_Plots_Residual_Sugars}
# Histogram of residual sugars
ggplot(aes(residual.sugar), data = red) +
geom_histogram(bins=30)
```
```{r echo=FALSE, Univariate_Plots_ResidualSugar}
# Summary of the residual sugar variable
summary(red$residual.sugar)
```
Surprisingly, most wines have low **residual.sugar**, and it could be that good quality is associated to extremly low or high residual sugar. This might be a good variable to help us distinguish low quality vs good quality wines.
Finally, let's check another left tailed distribution that according to the name seems to be related to the wine quality. I am referring to alcohol
```{r echo=FALSE, Univariate_Plots_Alcohol}
# Histogram of alcohol
ggplot(aes(alcohol), data = red) +
geom_histogram(bins=10)
```
```{r echo=FALSE, Univariate_Plots_Alcohol_Summary}
# Summary of the alcohol variable
summary(red$alcohol)
```
This histogram seems to have a peak between 9.5 and 10.5, but a very low level between 8.4 and 9.5. Same happens the higher the alcohol level gets, the smaller the bin height gets. This can be an indication of alcohol associated to the quality as the low and high quality wines are few.
> <font size="2">Note: We will need to explore more, but it seems skewed distributions might be related to quality.</font>
***
# Univariate Analysis
This univariate analysis was the first step on the exploration and to get familiar with the data. Basically, some histograms were performed to understand the distributions of the features and also to understand what are the most frequent quality grades of red wines in the dataset.
### What is the structure of your dataset?
The dataset is tidy and contains 1,599 red wines records with 12 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). However, the quality of the wines in the dataset only range from 3 to 8 since there are no 0,1,2,9 nor 10 graded wines in our data.
### What is/are the main feature(s) of interest in your dataset?
**Quality** is the main interest variable. Our goal is to figure out which elements contribute to the quality of a wine.
From our exploration I could tell that the quality has mainly 5 or 6 grade. Using some intuition at this point, we might consider that tailed histograms can be features that we want to consider as there is more information on the 5 and 6 grade compared to lower and higher red wine quality.
This idea also applies to distributions that seem bimodal or have more than one peak such as citric acid. In my opinion, this distributions might also have a hidden pattern related to quality and that might be the reason of having more than one mode.
Thus, after our histograms, some variables that seem promising when understanding quality are:
* Citric acid
* Fixed acidity
* Free sulfur dioxide
* Volatile acidity
* Total sulfur dioxide
* Alcohol
* Residual sugar
* Chlorides
* Sulphates
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
I also explored the density and it seems that it can be different for all wines, but the intersting part wil be to explore if the tails of the distribution contains elements from all wine qualities of if they are related to lower and higher quality wines.
### Did you create any new variables from existing variables in the dataset?
No, there was not a need for now to create a new variable.
### Of the features investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?
In the histograms, we found:
**Normal or very close to normal distributions**
* pH
* Density
**Left tailed distributions**
* Total Sulfur Dioxide
* Alcohol
* Fixed Acidity
* Free Sulfur Dioxide
* Volatile Acidity
* Chlorides
* Sulphates
* Residual sugar
**Bimodal or not normal distributions**
* Citric Acid
It seems that the not normal distributions might be more related to the quality since we have a higher number of quality 5 and 6 wines. This can mean that the tails can explain the lowest and highest quality wines on the dataset.
Luckily this dataset was made from tidy data and the only variable that needed a type change was quality. This was to make it a factor and have it as a true categorical variable rather than numerical.
***
# Bivariate Plots Section
In the previous section, we use our intuition to choose some relevant variables, so it will be a good idea to find what is the correlation between all variables in the dataset to narrow our exploration and remove possible colinearities.
The first thing to start our bivariate analysis will be to check which variables might have a correlation with each other. For this we will use pearson's correlation coefficient.
```{r echo=FALSE, Bivariate_Plots_Corr_Coef}
# Correlation Matrix
cols = length(red)
mcor <- cor(red[,2:cols-2])
round(mcor, 2)
```
Since data is hard to visualize from the correlation matrix, we will plot it.
```{r echo=FALSE, Bivariate_Plots_Corr_Plot}
# Plot of the correlation matrix
corrplot(mcor, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
```
From the plot we can easily see what variables are related and choose which ones to analyze:
Chlorides and sulphates are correlated so choosing only one of them might be good to start our analysis. Thus, I will add **sulphates** to our list.
The same happens for fixed acidity and citric acid. However, **citric acid** is a variable with a strange distribution, so we better keep it to explore how this relates to quality.
Citric acid seems also correlated to volatile acidity. However, volatile acidity is not correlated to fixed acidity. It might be good to keep **volatile acidity** and explore how it can contribute to quality.
Total sulfur dioxide and free sulfur dioxide seem correlated, but not with anything else. Thus, I will choose the **total sulfur dioxide** to investigate it.
We have already decided to keep citric acid, so we can keep **alcohol** despite it is a bit correlated, this is mainly to test if alcohol level actually impacts a wine quality.
> <font size="2">Note: Wine is alcohol, so my curiousity drives me to understand if the alcohol level matters.</font>
Finally, from our promising features, we have the **residual sugar** which seems not correlated to anything. it also had a slightly left tailed distribution, so it might be worthy to check how it affects quality or if it is neutral.
Once we have chosen these features, we had also decided to understand if **density** played part on the quality, so we will also explore it.
Now, we can generate plots to inspect the variables we chose to explore to find insights about the relationship of them to **quality**.
```{r echo=FALSE, warning=FALSE, Bivariate_Plots_All_Vars}
# Plot of relevant variables
interest = c('sulphates','density','citric.acid','alcohol','volatile.acidity','total.sulfur.dioxide', 'residual.sugar')
pairs.panels(red[,interest],bg=c("yellow","blue")[(red > 10)+1],pch=21,
main = 'Relation between Relevant Variables')
```
From the plot, we can see that alcohol and density seem to show the expected slight correlation we had before. Same applies to volatile acidity and citric acid. Besides those variables everything else seems to have almost no correlation which can be a good indicator to understand the independent variable that relate to quality.
> <font size="2">Note: Density is actually a variable with most of the slight correlations in the plot, so if we find density to be not very important, we can remove it. it also has a normal distribution which can be a reason of it correlating slightly with other variables.</font>
We can explore further relations between quality and variables. let's start from **density** since we noticed most wines have similar densities.
## Density
```{r echo=FALSE, Bivariate_Plots_Density_vs_Quality}
# scatter plot of density versus quality
ggplot(aes(x = quality, y = density), data = red) +
geom_jitter(alpha = 1/2)
```
We can see that there seems no correlation between density and quality. However, because it seems symetric, I applied some math to the density and plot it again
```{r echo=FALSE, Bivariate_Plots_abs_log_density_vs_Quality}
# scatter plot of density versus quality
ggplot(aes(x = quality, y = abs(log(density))), data = red) +
geom_jitter(alpha = 1/2)
```
After applying a log to the density and then obtaining the absolute value of that operation, we can confirm that wines with a higher quality seems to have a few high/low density outliers, while low quality wines have no high/low density outliers. However, most wines have a desnity in between 0.995 and 1.0 $g/cm^3$. This does feature seems to be slighly contributing to the quality.
```{r echo=FALSE, Bivariate_Plots_Density_Quality_Boxplot}
# Boxplot of Quality and Alcohol
ggplot(aes(y = density, x = quality_cat ), data = red) +
geom_boxplot(aes(fill = quality_cat)) +
scale_fill_brewer(type = 'seq')
```
```{r echo=FALSE, Bivariate_Plots_Density_Quality_Summary}
# Density distributions by quality levels
by(red$density, red$quality, summary)
```
The box plot shows that there is no much difference in density between wines which is what we noticed before in the histogram. However, the scatter plot and statistics show us that there is a slight difference in wines from quality 7 to 8 and here we can see that the higher quality wines tend to have slightly lower density on average.
## Alcohol
Another variable that caught my attention only because of the name is alcohol. A wine is alcohol so we should check what is the relation between alcohol and quality.
```{r echo=FALSE, Bivariate_Plots_Alcohol_vs_Quality}
# scatter plot of alcohol versus quality
ggplot(aes(x = quality, y = alcohol), data = red) +
geom_jitter(alpha = 1/2)
```
From the plot it seems that the higher the alcohol level the more quality a wine tends to have.
```{r echo=FALSE, Bivariate_Plots_Alcohol_Quality_Boxplot}
# Boxplot of Quality and Alcohol
ggplot(aes(y = alcohol, x = quality_cat ), data = red) +
geom_boxplot(aes(fill = quality_cat)) +
scale_fill_brewer(type = 'seq')
```
```{r echo=FALSE, Bivariate_Plots_Alcohol_Quality_Summary}
# Alcohol distributions by quality levels
by(red$alcohol, red$quality, summary)
```
The boxplot ans statistics confirm that from level 5, the average amount of alcohol increases as quality increases and it is oscillating while quality decreases.
## Volatile Acidity
Since we noticed that boxplots were easier to understand since **quality** is a categorical variable, we will follow analysis doing boxplots. The next variable is volatile acidity.
```{r echo=FALSE, Bivariate_Plots_VolatileAcidity_Quality_Boxplot}
# Boxplot of Quality and Volatile Acidity
ggplot(aes(y = volatile.acidity, x = quality_cat ), data = red) +
geom_boxplot(aes(fill = quality_cat)) +
scale_fill_brewer(type = 'seq')
```
Let's do some math to complement the boxplots.
```{r echo=FALSE, Bivariate_Plots_VolatileAcidity_Quality_Summary}
# Volatile acidity distributions by quality levels
by(red$volatile.acidity, red$quality, summary)
```
Interestingly enough the median volatile acidity constantly reduces as the wine **quality** grade increases.
At this point the **alcohol** and **volatile acidity** seem to be related to wine quality.
## Sulphates
Another variable to analyze is sulphates
```{r echo=FALSE, Bivariate_Plots_Sulphates_Quality_Boxplot}
# Boxplot of Quality and Sulphates
ggplot(aes(y = sulphates, x = quality_cat ), data = red) +
geom_boxplot(aes(fill = quality_cat)) +
scale_fill_brewer(type = 'seq')
```
```{r echo=FALSE, Bivariate_Plots_Sulphates_Quality_Summary}
# Sulphates distributions by quality levels
by(red$sulphates, red$quality, summary)
```
The sulphates also seem to have a correlation to the **quality** as their mean and median increase the higher the quality of a red wine. However, we can notice that the median and average values of higher qualities also exist in the third and max values of lower quality wines. This makes sure that there must be another variable that can help us define quality. In fact, we have **alcohol** and **volatile acidity** as possible variables to help define quality and now we add **sulphates** to the list.
## Citric Acid
If the alcohol and volatile acidity had a correlation to the wine quality, it will be interesting to check the citric acidity, which is correlated to volatile acidity from what we saw in the correlation matrix.
```{r echo=FALSE, Bivariate_Plots_CitricAcid_Quality_Boxplot}
# Boxplot of Quality and Citric Acid
ggplot(aes(y = citric.acid, x = quality_cat ), data = red) +
geom_boxplot(aes(fill = quality_cat)) +
scale_fill_brewer(type = 'seq')
```
```{r echo=FALSE, Bivariate_Plots_CitricAcid_Quality_Summary}
# CitricAcid distributions by quality levels
by(red$citric.acid, red$quality, summary)
```
We noticed an increasing trend on the average citric acid as well as the quality, this is the opposite as the volatile acidity. In other words, the more citric acid the more quality a wine has and this makes perfect sense since citric acid and volatile acidity have a negative correlation.
## Total Sulfur Dioxide
```{r echo=FALSE, Bivariate_Plots_TotalSulfurDioxide_Quality_Boxplot}
# Boxplot of Quality and Total Sulfur Dioxide
ggplot(aes(y = total.sulfur.dioxide, x = quality_cat ), data = red) +
geom_boxplot(aes(fill = quality_cat)) +
scale_fill_brewer(type = 'seq')
```
```{r echo=FALSE, Bivariate_Plots_TotalSulfurDioxide_Quality_Summary}
# Total Sulfur Dioxide distributions by quality levels
by(red$total.sulfur.dioxide, red$quality, summary)
```
The total sulfur dioxide does not have a clear correlation to the quality of a wine since its average values peak at quality level 5, but decreases as quality moves away from level 5.
## Residual Sugar
Finally, we can analyze the residual sugar and how it is related to quality. There are some wines that are considered sweeter than others. Could this also be realted to quality?
```{r echo=FALSE, Bivariate_Plots_ResidualSugar_Quality_Boxplot}
# Boxplot of Quality and Residual Sugar
ggplot(aes(y = residual.sugar, x = quality_cat ), data = red) +
geom_boxplot(aes(fill = quality_cat)) +
scale_fill_brewer(type = 'seq')
```
```{r echo=FALSE, Bivariate_Plots_ResidualSugar_Quality_Summary}
# Residual Sugar distributions by quality levels
by(red$residual.sugar, red$quality, summary)
```
Apparently the residual sugar of a wine does not define the quality of a wine since it average values oscillate from quality level 3 to 8. Maybe that can explain why wines that are sweeter can be also have high quality.
Once we have analyzed our desired variables, we can make some conclusions out of the plots.
***
# Bivariate Analysis
From the variables we chose, we found that 4 or 5 of them seem to be contributing to the quality of a red wine. The order in which, I would rank them (1 being the highest noticeable contribution):
1. Volatile acidity
2. Sulphates
3. Citric acid
4. Alcohol
5. Density
### How did the feature(s) of interest vary with other features in the dataset?
In this bivariate analysis we found the correlation matrix between our variables, we left the quality out of the matrix as it is a categorical variable.
From the correlation matrix, we reduce the number of promising features to explore based on the variables' correlation.
> <font size="2">Note: pH was correlated to almost all our variables.</font>
Soemthing we mentioned before is that the features of interest had a left tailed distribution or completely random. This was also seen in the boxplots of each feature vs quality. Moreover, with the correlation matrix we narrow the features of interest even further.
### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
We had decided to review the density since we thought it might contribute to define highest and lowest quality wines. It in fact was confirmed that density slightly contribute to quality since the higher the quality of a wine the slightly lower density the wine has. This was first shown with a scatter plot and then better appreciated with a box plot (which in practice are better in this cases as I use a categorical variable vs a numerical variable).
### What was the strongest relationship you found?
In order to determine the strongest relationship we used box plots for all the chosen features of interest selectd after the correlation matrix. the analysis was as follow:
From the more promising features chosen, we started with alcohol vs quality. Since wine is alcoholic, this might affect wine taste in a huge proportion. We found that indeed the higher the level of alcohol the higher the quality of the red wine.
Then, we checked volatile acidity and we found that it constantly reduces while the **quality** grade of a wine gets higher.
From there, we did a boxplot of sulphates and they showed to have a trend related to **quality**. The sulphates get higher for higher quality wines.
Citric acid was our next variable to analyze and mainly because it had a completely different distribution compared to other features. Something interesting about citric acid is that it was correlated to volatile acidity. Since both turned to be related we were not surprised that citric acid that for quality wines 5 to 8 the more citric acid the higher the quality of a wine.
Finally, the total sulfur dioxide and the residual sugar didn't show a clear correlation to the quality and no trends were noticed.
If we sort these features of interest and our additional variable which was density, we ended up with the following list which is ordered from strongest to weakest relationship with quality of the variable listed:
1. Volatile acidity - Negative. Lower volatile acidity, higher quality
2. Sulphates - Positive. Higher sulphates, higher quality
3. Citric acid - Positive. Higher citric acid, higher quality. [Negative correlation with Volatile acidity]
4. Alcohol - Positive. Higher alcohol level, higher quality
5. Density - Negative. Lower density, higher quality. [Negative correlation with Alcohol]
However, we can notice that the last two variables citric acid and density are correlated to volatile acidity and alcohol, respectively. This might mean we need to drop these two variables. In any situation, these two variables were left only for curiosity and exploration.
***
# Multivariate Plots Section
We start plotting the variables that showed a trend in relation to quality.
```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_Sulphates_Quality}
# Plot of volatile acidity and sulphates
ggplot(aes(x = volatile.acidity, y = sulphates , color = quality_cat, fill = quality_cat), data = red) +
geom_hex(alpha=0.7, size = 2) +
scale_fill_brewer( type = 'div') +
scale_colour_brewer( type = 'seq') +
theme(panel.background = element_rect(fill='darkgray'))
```
Let's divide this plot in plots per quality category
```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_Sulphates_Quality_FacetWrap}
# Plot of volatile acidity and sulphates with facet wrap of quality
ggplot(aes(x = volatile.acidity, y = sulphates , color = quality_cat, fill = quality_cat), data = red) +
geom_hex(alpha=0.9, size = 2) +
scale_fill_brewer( type = 'div') +
scale_colour_brewer( type = 'seq') +
theme(panel.background = element_rect(fill='darkgray')) +
facet_wrap(~ quality_cat)
```
We can see from the plots that there are some areas where the quality levels get clustered. Let's check the volatile acidity and sulphates counts of wines per quality.
Starting with volatile acidity
```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_Quality_Table}
# Quality per volatile acidity levels
by(red$quality_cat, round(red$volatile.acidity,1), table)
```
Then, we do sulphates counts
```{r echo=FALSE, Multivariate_Plots_Sulphates_Quality_Table}
# Quality per sulphates levels
by(red$quality_cat, round(red$sulphates,1), table)
```
Surprisingly, the sulphates and the volatile acidity form clusters of the red wine quality levels. The sulphates range [0.6,0.9] and the volatile acidity range [0.3,0.6] contain the higher number of high quality wines. This might be something very insightful to determine the quality of a red wine and these two variables might be what we are interested the most.
Let's analyze these two variables (volatile acidity and sulphates) and compare each of them against the citric acid variable, which from level 5 quality showed a trend that higher the citric acid, the higher the quality.
We will start with the sulphates and the citric acid plot
```{r echo=FALSE, Multivariate_Plots_Sulphates_CitricAcid_Quality}
# plot of variables sulphates and citric acid
ggplot(aes(x = sulphates, y = citric.acid,
color = quality_cat, fill = quality_cat), data = red) +
geom_hex(size=2) +
scale_fill_brewer(palette = 'Greens') +
scale_color_brewer(type='div') +
theme(panel.background = element_rect(fill='gray'))
```
Surprinsingly, the citric acid and sulphates do not cluster the quality levels like the volatile acidity and sulphates did. Despite citric acid being negatively correlated to volatile acidity.
Let's continue the analysis comparing volatile acidity and citric acid.
```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_CitricAcid_Quality}
# plot of variables volatile acidity and citric acid
ggplot(aes(x = volatile.acidity, y = citric.acid, color = quality_cat, fill = quality_cat), data = red) +
geom_hex(size=2) +
scale_fill_brewer(palette = 'Greens') +
scale_color_brewer(type='div') +
theme(panel.background = element_rect(fill='gray'))
```
```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_CitricAcid_Correlation}
# correlation of volatile acidity and citric acid
cor(red$volatile.acidity, red$citric.acid, method = "spearman")
```
After comparing the volatile acidity and the citric acid we can see a slight negative correlation betwen these two variables. However, the 8 level quality values are very sparse. The level 7 seem to be clustered in two blobs, but it is pretty simialr to what we saw in the citric acid and sulphates plot. Thus, we might have citric acid influecing a little, but not a major factor. Also, we have volatile acidity which is slightly correlated to citric acid, so if we were to build a model we rather take volatile acidity and sulphates as of now.
Once we have checked the first three variables we saw show clearer trends related to quality, we cna move to analyze our last two variables: alcohol and density.
Let's move forward in the analysis using the alcohol variable and compare it against out two main variables.
As a reminder, the alcohol showed a trend in which higher alcohol, the higher the quality.
Let's start comparing alcohol and volatile acidity.
```{r echo=FALSE, Multivariate_Plots_Alcohol_VolatileAcidity_Quality}
# plot of alcohol and volatile acidity
ggplot(aes(x = alcohol, y = volatile.acidity, color = quality_cat, fill = quality_cat), data = red) +
geom_hex(size=2) +
scale_fill_brewer(type='seq') +
scale_color_brewer(type = 'seq') +
theme(panel.background = element_rect(fill='darkgray')) +
facet_wrap(~quality_cat)
```
```{r echo=FALSE, Multivariate_Plots_Alcohol_VolatileAcidity_Quality_Table}
# Quality per alcohol levels
by(red$quality_cat, round(red$alcohol), table)
```
In the plot, it seems that the lower the volatile acidity and the lower the alcohol the quality tends to be good. However, the higher the alcohol gets, it allows the volatile acidity to go higher and still get some good quality wines.
The alcohol range for which we get higher qualities goes from 10 to 14 % volume. While the volatile acidity as we saw in previous plots goes from 0.3 to 0.6 $g/dm^3$.
Once we compared the alcohol to the volatile acidity, let's move to compare to the sulphates
```{r echo=FALSE, Multivariate_Plots_Alcohol_Sulphates_Quality}
# Plot of strongest variable alcohol and the third strongest sulphates
ggplot(aes(x = alcohol, y = sulphates, color = quality_cat, fill = quality_cat), data = red) +
geom_hex(alpha=0.85, size = 8) +
scale_fill_brewer( palette = 'Purples') +
scale_colour_brewer(palette = 'Purples') +
theme(panel.background = element_rect(fill='darkgray'))
```
This plot shows that the higher quality wines are in a specific range of sulphates and alcohol values. We can do some counting of wines per quality according to alcohol and sulphates to find the ranges where best quality wines live.
Starting with the alcohol and quality counts:
```{r echo=FALSE, Multivariate_Plots_Alcohol_Quality_Table}
# Quality per Alcohol levels
by(red$quality_cat, round(red$alcohol), table)
```
Then, we move to sulphates and quality counts:
```{r echo=FALSE, Multivariate_Plots_Sulphates_Quality_Table2}
# Quality per Sulphates levels
by(red$quality_cat, round(red$sulphates,1), table)
```
From the plot and statistics, we can see that it is more common to have good wines after 10 % volume alcohol(except on 15% vol) and the higher the alcohol level, more chances of a grade 8 wine. Also, it seems grade 8 wines mainly appear when sulphates are in the range of 0.6 and 0.9.
Once we have compared the alcohol with sulphates and found not a very strong pattern, we can plot alcohol versus citric acid.
```{r echo=FALSE, Multivariate_Plots_Alcohol_CitricAcid_Quality}
# plot of variables volatile acidity and citric acid
ggplot(aes(x = alcohol, y = citric.acid, color = quality_cat, fill = quality_cat), data = red) +
geom_hex(size=2) +
scale_fill_brewer(palette = 'Greens') +
scale_color_brewer(type='div') +
theme(panel.background = element_rect(fill='gray'))
```
The higher level quality wines are very sparse, so these variables together might not be good indicator of quality.
Finally, let's explore the alcohol and density since they show a correlation in our correlation matrix.
```{r echo=FALSE, Multivariate_Plots_Alcohol_Density_Quality}
# plot of variables alcohol and density
ggplot(aes(x = alcohol, y = density, color = quality_cat, fill = quality_cat), data = red) +
geom_hex(alpha=0.7, size=2) +
scale_fill_brewer(palette ='Purples') +
scale_color_brewer(palette ='Purples') +
theme(panel.background = element_rect(fill='darkgray'))
```
Despite alcohol and density are slightly correlated, they together do not explain the high quality wines.
Before making conclusions about the multivariate analysis, it might be interesting to see if the volatile acidity per % volume of alcohol actually helps alcohol play part in this analysis.
```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_per_Alcohol_vs_Sulphates_Quality}
# Plot of volatile acidity and sulphates
ggplot(aes(x = volatile.acidity/alcohol, y = sulphates , color = quality_cat, fill = quality_cat), data = red) +
geom_hex(alpha=0.7, size = 2) +
scale_fill_brewer( type = 'div') +
scale_colour_brewer( type = 'seq') +
theme(panel.background = element_rect(fill='darkgray'))
```
Surprisingly dividing the volatile acidity by the % volume of alcohol made the higher quality wines come closer and be clustered easier than without dividing the volatile acidity with alcohol.
We can compare this plot against the first plot on this multivariate analysis if we use some numbers.
```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_per_Alcohol_Quality_Table}
# Quality per volatile acidity by alcohol levels
by(red$quality_cat, round(red$volatile.acidity/red$alcohol,2), table)
```
From the table in here and what we learned on the table fo the first plot of the section, we can see that when we divided with alcohol the volatle acidity we definitively helped to make high quality wines come together. In fact, in the first plot of volatile acidity versus sulphates, our maximum number of wines of quality 8 was 8 and it happened when volatile acidity was 0.4 $g/dm^3$. On the other hand, when we plotted volatile acidity by alcohol versus sulphates, the maximum number of quality 8 wines was 12 and it was present when volatile acidity per % alcohol was 0.03 $g/dm^3$ per % vol of alcohol. This defiitively shows a better cluster for quality of wines.
If we want to go a bit further we can try to measure correlation for `cor( volatile acidity, sulphates)` and `cor( volatile acidity per % volume of alcohol, sulphates)`.
> <font size="2">Note: We only need numbers for the volatile acidity divided by alcohol since sulphates stayed the same.</font>
```{r echo=FALSE, Multivariate_Plots_VolatileAcidity_per_Alcohol_Sulphates_Corr}
# Quality per sulphates levels
cor(red$volatile.acidity, red$sulphates,method = "spearman")
cor(red$volatile.acidity/red$alcohol, red$sulphates, method= "spearman")
```
We can see the first correlation coefficient to be closer to 0 and that makes volatile acidity and sulphates less correlated than volatile acidity per alcohol and sulphates.
I still see they are not so correlated, but this is more of a classification problem, so mutliple logistic regression would be a better candidate to build a model in the future.
***
# Multivariate Analysis
### Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?
It was surprising to find that sulphates in the range of 0.6 to 0.9 $g/dm^3$ and volatile acidity range of 0.6 and 0.9 $g/dm^3$ contain the highest quality wines. It was also interesting to find that these two chemical properties are making clusters of the quality levels. This all makes sense since quality is a category and we are facing a classification problem.
### Were there any interesting or surprising interactions between features?
Finding that volatile acidity and sulphates are very promising to find the quality of a wine was surprising, but it was even more interesting to find that when we divided the volatile acidity by the alcohol and plot that versus the sulphates gave even a better relation of variables to find the chemicals that can be driving the quality of a wine.
In fact, the cluster of quality 8 wines was 50% greater when volatile acidity was divided by alcohol than by leaving it alone.
This really exemplified the importance of combining features.
***
# Final Plots and Summary
### Plot One
```{r echo=FALSE, Plot_One}
# bar plot of quality
ggplot(aes(x = quality_cat, fill = quality_cat ), data = red) +
geom_bar(stat = "count") +
scale_fill_brewer(type='seq') +
ggtitle("Quality of Wines") +
labs(x = "Quality Level", y = "Number of Wines",
caption = '\nNote: Wine quality goes from 0 (lowest) to 10 (highest)') +
theme(panel.background = element_rect(fill='black'),
panel.grid = element_blank(),
legend.position = "None",
plot.title = element_text(face = "bold", hjust = 0.5,size = 18),
axis.title.x = element_text(color = 'black'),
axis.title.y = element_text(color = 'black'),
plot.caption = element_text(hjust = 1, size = 9, color = 'darkgray'))
```
### Description One
This was the first plot we made and it helped us to understand the amount of wines per quality category that we have in our dataset. Clearly, we can see that we have more quality 5 and 6 wines than any other category of wines. The fact that we know that allowed us to understand that it will be completely fine that chemical components will not show normal distributions. In other words, we will have many skewed distributions and these might be the ones we take a look at to find promising variables that explain the quality of a red wine.
### Plot Two
```{r echo=FALSE, Plot_Two}
# Boxplot of Quality and Volatile Acidity
p1 <- ggplot(aes(y = volatile.acidity, x = quality_cat ), data = red) +
geom_boxplot(aes(fill = quality_cat)) +
scale_fill_brewer(type = 'seq') +
labs(title='Volatile Acidity by Quality',
x = '[Low] Quality [High]',
y = 'Volatile Acidity [g/dm^3]') +
theme(legend.position = "None",
plot.title = element_text(face = "bold", hjust = 0.5,size = 15),
axis.title.x = element_text(color = 'black', size = 10),
axis.title.y = element_text(color = 'black', size = 10))
# Boxplot of Quality and Sulphates
p2 <- ggplot(aes(y = sulphates, x = quality_cat ), data = red) +
geom_boxplot(aes(fill = quality_cat)) +
scale_fill_brewer(type = 'seq') +
labs(title='Sulphates by Quality',
x = '[Low] Quality [High]',
y = 'Sulphates [g/dm^3]') +
theme(legend.position = "None",
plot.title = element_text(face = "bold", hjust = 0.5,size = 15),
axis.title.x = element_text(color = 'black', size = 10),
axis.title.y = element_text(color = 'black', size = 10))
# Putting together the plots in a single image
grid.arrange(p1,p2,ncol=2)
```
### Description Two
Once we analyzed variables distributions by quality, there were more details about chemical properties being more related to quality levels. The box plots allowed to discover two important checmical properties: Volatile acidity and sulphates.
The box plots picture that volatile acidity $g/gm^3$ decrease the higher the quality a wine has. While the sulphates $g/dm^3$ increase in higher quality wines.
The fact that we found these two variables helped understand that there are some checmical properties that can explain the quality of a red wine.
### Plot Three
```{r echo=FALSE, Plot_Three}
# Plot of volatile acidity per alcohol and sulphates
ggplot(aes(x = volatile.acidity/alcohol, y = sulphates , color = quality_cat,
fill = quality_cat), data = red) +
geom_hex(alpha=0.8, size = 2) +
coord_cartesian(ylim=c(0.3, 1.4), xlim=c(0.008,0.13)) +
scale_fill_brewer( type = 'div') +
scale_colour_brewer( type = 'seq') +
labs(title = 'Quality of Wines Clustered by \nSulphates and Volatile Acidity per Alcohol % Vol', x = 'Volatile Acidity per % Alcohol [g/dm^3]' , y = 'Sulphates [g/dm^3]',
fill = 'Wine Quality', colour = 'Wine Quality',
caption = '\nNote: Wine quality goes from 0 (lowest) to 10 (highest)') +
theme(panel.background = element_rect(fill='darkgray'),
plot.title = element_text(hjust = 0.5, face = 'bold'),
legend.title = element_text(hjust = 0.5, size = 9, face = 'bold'),
legend.justification = "right",
plot.caption = element_text(hjust = 1, size = 9, color = 'darkgray'))
```
### Description Three
This plot shows how the different levels of quality of a red wine overlap and at the same time it shows clusters of the levels of quality. This is very important when identifying the quality of a specific wine since this is a classification problem.
The fact that we see clusters using volatile acidity per alcohol and sulphates means that these chemical properties are good indicators of how a red wine quality.
We can also see some outliers in the data, but mainly we see groups of our quality levels. One can also appreciate quality 5 and 6 clusters spread all over the plot which confirms what we learned from our dataset at the very beggining.
All in all, this plot can be very useful to see the different quality levels and in which ranges of $g/dm^3$ of volatile acidity per % alcohol and sulphates one can have higher quality wines.
***
# Reflection
In this project, I happened to learn a lot about red wine quality and the checmical properties that are measure to give a wine a quality grade.
The hardest part of the analysis was to first start understanding the data and try to find something useful to start with and continue unveiling more and more in new analysis.
While plotting univariate distributions it was very hard to determine if a variable might be related to an important finding, but using the intuition was key and challenging. Later on, during bivariate analysis it was a bit easier to decide which variables to choose based on the box plots the trends seen with quality, but the possible plots to explore and relations to find were too many to even consider them all. At that point I just had the idea to explore correlated variables and start from simple to have a strong decision based on simple relations rather than very complicated ones. However, I felt I was approaching a dead end when a chemical property showed no relation to quality. It was challenging that things that made sense for me such as alcohol related to a wine quality were actually not really true when exploring the data. However, I was successful in finding the volatile acidity and the sulphates as important variables that seemed to follow a trend related to quality levels. From there things became a bit easier and I knew exactly what I wanted to do at the very beggining of multivariate analysis. In this final analysis, I started plotting the volatile acidity versus the sulphates and coloring them by quality to see if there were any visible signs of cluster of wine quality. In fact, there were clusters, a bit spread but decent enough to see them. I continue to a dead end in which no other chemical components showed to form a cluster or explain quality. That was when I decided that my first plot of volatile acidity and sulphates was the best I have gotten so far. Then, it was a day later that I still not believe that alcohol % volume was not related to quality since wine is alcohol. That lead me to divide volatile acidity over alcohol and I found something that blew my mind once I plotted such variables. The clusters I have found in the plot of volatile acidity versus the sulphates were now tighter. This finding was in fact a reward of the hard effort to figure out if other variables that logically seemed to be connected to the quality of a wine, specially alcohol, were actually connected.
All in all, I had a great experience anayzing the data, struggling and figuring out a new way when things didn't look good in the path I was taking. Other times, I just had explore until I found something useful to move further with such finding. This project definitively taught me how to approach an EDA to find something interesting and also to trust intution once in a while.
# Future Work
As far as future work with the findings, the best next step will be to create a multi-classifier logistic regression model. I would also be tempted to get more data to get a more balanced dataset in all categories and also include quality levels 0, 1, 2, 9 and 10. This will definitively benefit the model we build and we might even have to do another quick exploration to determine our current chemical properties still explain red wines quality.
# Limitations
The limitations in these analysis are related to data unbalance and a possible bias problem toward the quality 5 and 6 red wines. In here also, we explored one of the relations that seemed strong to explain quality; however, there are many possibilities as this is an open ended problem that someone else can find different relations that can be close to explain quality of a red wine. Another limitation is that the dataset contained 12 variables that are related to a wine, but there might be more such as time that a wine was left to get to point, the type of grapes used, the region form where the wine is, etc. More variables can contribute to determine the quality and we might be looking at only a small set of them. All in all, be aware that this analysis helps understand quality form the current dataset, but not for every type of wine you can find out there in the world.
## Resources
While creating this project there were no special resources used, but the [documentation of R ggplot](https://ggplot2.tidyverse.org/index.html).