-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rhtml
153 lines (142 loc) · 8.17 KB
/
index.Rhtml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
<!DOCTYPE html>
<html>
<head>
<title>Course Project for Practical Machine Learning</title>
</head>
<style type="text/css">
body, td {
font-family: sans-serif;
background-color: white;
font-size: 13px;
}
</style>
<body>
<h1>Course Project: Practical Machine Learning | Coursera</h1>
<h2>Author: Nabeel Mukhtar <[email protected]></h2>
<!--begin.rcode, message=FALSE
library(knitr)
opts_chunk$set(fig.width=8, fig.height=8)
end.rcode-->
<p>This is the course project for <a href="https://class.coursera.org/predmachlearn-006">Coursera Practical Machine Learning</a>.
</p>
<p><h3>Objective</h3>
The baseline performance index for this HAR dataset is 99% accuracy (see References). However for the purpose of this assignment we will target an out of sample accuracy of 95% on the testing set.
</p>
<p><h3>Initialization</h3>
Here we initialize random number generator with a seed and load training data. Note that the training data contains #DIV/0! which needs to be parsed as NA.</p>
<!--begin.rcode initialize, message=FALSE
library(caret)
## set seed
set.seed(32343)
wle_data <- read.csv("data/pml-training.csv", na.strings = c("", "NA", "#DIV/0!"))
end.rcode-->
<p>
<h3>Features and Training Parameters</h3>
We decided to use all the predictors for training as using PCA did not result in any improvement. Here we select all predictors that do not contain NAs and are not text/date columns (like user_name/raw_timestamp_part_1 etc). There was no need for removing zero vars because there weren't any in the leftover variables.<br/>
We also initialize training control parameters with 5-fold repeated cross validation for all classifiers.
</p>
<!--begin.rcode cleanup, message=FALSE
predictors <- colnames(wle_data)
predictors <- predictors[colSums(is.na(wle_data)) == 0]
predictors <- predictors[-(1:7)]
# nsv <- nearZeroVar(wle_data[, predictors])
# predictors <- predictors[-nsv]
classes <- unique(wle_data$classe)
class_colors <- 1 + as.integer(classes)
fitControl <- trainControl(method="repeatedcv",
number=5,
repeats=1,
verboseIter=FALSE)
end.rcode-->
<p>
<h3>Data Partitioning</h3>
Here we split the data into into training(49%), testing(21%) and validation(30%) datasets. The validation dataset was used for ensemble classifier in the end.<br/>
Finally we removed unused variables and called gc() to reclaim memory for later analysis.
</p>
<!--begin.rcode splitting, message=FALSE
inBuild <- createDataPartition(y=wle_data$classe,
p=0.7, list=FALSE)
validation <- wle_data[-inBuild, predictors]
buildData <- wle_data[inBuild, predictors]
inTrain <- createDataPartition(y=buildData$classe,
p=0.7, list=FALSE)
training <- buildData[inTrain, ]
testing <- buildData[-inTrain, ]
rm(buildData, wle_data, inBuild, inTrain)
clean <- gc(FALSE)
rm(clean)
end.rcode-->
<p>
<h3>First Attempt: Decision Tree</h3>
The first classifier we tried was a decision tree. The reason is that the model generated by the decision tree is very interpretable and also gives insights on which predictors are more important which is useful for further feature extraction.<br/>
Unfortunately the accuracy of the tree was not very good (56%) on the testing set. We tried different configurations but it did not help much. Here are the tree and results.
</p>
<!--begin.rcode tree, message=FALSE, fig.align='center', results='asis'
modeltree <- train(classe ~., data=training, method="rpart", trControl=fitControl)
library(rattle)
fancyRpartPlot(modeltree$finalModel)
predicttree <- predict(modeltree, newdata=testing)
cmtree <- confusionMatrix(predicttree, testing$classe)
plot(cmtree$table, col = class_colors, main = paste("Decision Tree Confusion Matrix: Accuracy=", round(cmtree$overall['Accuracy'], 2)))
kable(cmtree$byClass, digits = 2, caption = "Per Class Metrics")
end.rcode-->
<p>
<h3>Second Attempt: Linear Discriminant Analysis</h3>
The second classifier we tried was an LDA classifier which was also taught in the course. We chose the default parameters with a 5 fold cross validation. The accuracy was vastly improved to 71% but it was still not very good. Here are the results.
</p>
<!--begin.rcode lda, message=FALSE, fig.align='center', results='asis'
modellda <- train(classe ~., data=training, method="lda", trControl=fitControl)
predictlda <- predict(modellda, newdata=testing)
cmlda <- confusionMatrix(predictlda, testing$classe)
plot(cmlda$table, col = class_colors, main = paste("LDA Confusion Matrix: Accuracy=", round(cmlda$overall['Accuracy'], 2)))
kable(cmlda$byClass, digits = 2, caption = "Per Class Metrics")
end.rcode-->
<p>
<h3>Third Attempt: Generalized Boosted Regression Modeling</h3>
Finally we stumbled upon the GBM classifier. We ran it with repeated cross validation and its accuracy was much better (96%) on the training set even though it took much longer to execute.
</p>
<!--begin.rcode gbm, message=FALSE, fig.align='center', results='asis'
modelgbm <- train(classe ~., data=training, method="gbm", trControl=fitControl, verbose = FALSE)
predictgbm <- predict(modelgbm, newdata=testing)
cmgbm <- confusionMatrix(predictgbm, testing$classe)
plot(cmgbm$table, col = class_colors, main = paste("GBM Confusion Matrix: Accuracy=", round(cmgbm$overall['Accuracy'], 2)))
kable(cmgbm$byClass, digits = 2, caption = "Per Class Metrics")
end.rcode-->
<p>
<h3>Final Attempt: Ensemble Classifier</h3>
Moving on we decided to build an ensemble of the first three classifiers by stacking them with random forest to see if it further improves the accuracy. Unfortunately the accuracy was still 96% with a slight improvement in the accuracy of class E.
</p>
<!--begin.rcode ensemble, message=FALSE, fig.align='center', results='asis'
predicttesting <- data.frame(predicttree, predictgbm, predictlda, classe = testing$classe)
modelensemble <- train(classe ~ ., data = predicttesting, method = "rf")
predictvalidation <- data.frame(predicttree = predict(modeltree, newdata=validation),
predictgbm = predict(modelgbm, newdata=validation),
predictlda = predict(modellda, newdata=validation),
classe = validation$classe)
predictensemble <- predict(modelensemble, predictvalidation)
cmensemble <- confusionMatrix(predictensemble, validation$classe)
plot(cmensemble$table, col = class_colors, main = paste("Ensemble Confusion Matrix: Accuracy=", round(cmensemble$overall['Accuracy'], 2)))
kable(cmensemble$byClass, digits = 2, caption = "Per Class Metrics")
end.rcode-->
<p>
<h3>Even Better Attempt: Random Forest</h3>
We had tried random forest initially but could not make it complete in a reasonable amount of time. After evaluating some assignments we realized that a random forest would have been an even more accurate classifier and would finish in reasonable time if we do proper feature selection and limit the number of trees to 100. Here's another attempt with the best accuracy of 98%.
</p>
<!--begin.rcode rf, message=FALSE, fig.align='center', results='asis'
modelrf <- train(classe ~ roll_belt + pitch_forearm + magnet_dumbbell_z + yaw_belt + magnet_dumbbell_y + roll_forearm + pitch_belt, data=training, method="rf", ntree = 100)
predictrf <- predict(modelrf, newdata=testing)
cmrf <- confusionMatrix(predictrf, testing$classe)
plot(cmrf$table, col = class_colors, main = paste("Random Forest Confusion Matrix: Accuracy=", round(cmrf$overall['Accuracy'], 2)))
kable(cmrf$byClass, digits = 2, caption = "Per Class Metrics")
end.rcode-->
<p>
<h3>Conclusion</h3>
Since the ensemble classifier could not improve the accuracy of the classifiers probably because the accuracies of the LDA and Descision Tree were much less than GBM. So in the end we decided to go with the GBM classifier which has the best accuracy of 96%. Alas we could not evaluate random forest before the assignment deadline.
</p>
<p>
<h3>References</h3>
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz3Gx4pWyLd
</p>
</body>
</html>