-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathen_Evaluation Metrics in Classification.srt
340 lines (255 loc) · 8.15 KB
/
en_Evaluation Metrics in Classification.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
0
00:00:00,440 --> 00:00:02,380
Hello, and welcome!
1
00:00:02,380 --> 00:00:07,140
In this video, we’ll be covering evaluation metrics for classifiers.
2
00:00:07,140 --> 00:00:09,749
So let’s get started.
3
00:00:09,749 --> 00:00:12,829
Evaluation metrics explain the performance of a model.
4
00:00:12,829 --> 00:00:18,490
Let’s talk more about the model evaluation metrics that are used for classification.
5
00:00:18,490 --> 00:00:24,680
Imagine that we have an historical dataset which shows the customer churn for a telecommunication
6
00:00:24,680 --> 00:00:25,720
company.
7
00:00:25,720 --> 00:00:31,550
We have trained the model, and now we want to calculate its accuracy using the test set.
8
00:00:31,550 --> 00:00:36,890
We pass the test set to our model, and we find the predicted labels.
9
00:00:36,890 --> 00:00:40,579
Now the question is, “How accurate is this model?”
10
00:00:40,579 --> 00:00:46,030
Basically, we compare the actual values in the test set with the values predicted by
11
00:00:46,030 --> 00:00:50,769
the model, to calculate the accuracy of the model.
12
00:00:50,769 --> 00:00:56,030
Evaluation metrics provide a key role in the development of a model, as they provide insight
13
00:00:56,030 --> 00:00:59,410
to areas that might require improvement.
14
00:00:59,410 --> 00:01:05,160
There are different model evaluation metrics but we just talk about three of them here,
15
00:01:05,160 --> 00:01:10,110
specifically: Jaccard index, F1-score, and Log Loss.
16
00:01:10,110 --> 00:01:16,220
Let’s first look at one of the simplest accuracy measurements, the Jaccard index -- also
17
00:01:16,220 --> 00:01:19,110
known as the Jaccard similarity coefficient.
18
00:01:19,110 --> 00:01:24,100
Let’s say y shows the true labels of the churn dataset.
19
00:01:24,100 --> 00:01:28,770
And y ̂ shows the predicted values by our classifier.
20
00:01:28,770 --> 00:01:33,980
Then we can define Jaccard as the size of the intersection divided by the size of the
21
00:01:33,980 --> 00:01:37,390
union of two label sets.
22
00:01:37,390 --> 00:01:44,570
For example, for a test set of size 10, with 8 correct predictions, or 8 intersections,
23
00:01:44,570 --> 00:01:49,460
the accuracy by the Jaccard index would be 0.66.
24
00:01:49,460 --> 00:01:54,820
If the entire set of predicted labels for a sample strictly matches with the true set
25
00:01:54,820 --> 00:02:02,840
of labels, then the subset accuracy is 1.0; otherwise it is 0.0.
26
00:02:02,840 --> 00:02:09,060
Another way of looking at accuracy of classifiers is to look at a confusion matrix.
27
00:02:09,060 --> 00:02:14,190
For example, let’s assume that our test set has only 40 rows.
28
00:02:14,190 --> 00:02:19,950
This matrix shows the corrected and wrong predictions, in comparison with the actual
29
00:02:19,950 --> 00:02:21,030
labels.
30
00:02:21,030 --> 00:02:27,670
Each confusion matrix row shows the Actual/True labels in the test set, and the columns show
31
00:02:27,670 --> 00:02:30,660
the predicted labels by classifier.
32
00:02:30,660 --> 00:02:32,750
Look at the first row.
33
00:02:32,750 --> 00:02:38,890
The first row is for customers whose actual churn value in the test set is 1.
34
00:02:38,890 --> 00:02:45,520
As you can calculate, out of 40 customers, the churn value of 15 of them is 1.
35
00:02:45,520 --> 00:02:52,270
And out of these 15, the classifier correctly predicted 6 of them as 1, and 9 of them as 0.
36
00:02:53,600 --> 00:02:59,470
This means that for 6 customers, the actual churn value was 1, in the test set, and the
37
00:02:59,470 --> 00:03:03,410
classifier also correctly predicted those as 1.
38
00:03:03,410 --> 00:03:10,770
However, while the actual label of 9 customers was 1, the classifier predicted those as 0,
39
00:03:10,770 --> 00:03:12,850
which is not very good.
40
00:03:12,850 --> 00:03:17,380
We can consider this as an error of the model for the first row.
41
00:03:17,380 --> 00:03:20,590
What about the customers with a churn value 0?
42
00:03:20,590 --> 00:03:23,040
Let’s look at the second row.
43
00:03:23,040 --> 00:03:28,060
It looks like there were 25 customers whose churn value was 0.
44
00:03:28,060 --> 00:03:34,210
The classifier correctly predicted 24 of them as 0, and one of them wrongly predicted as 1.
45
00:03:35,210 --> 00:03:40,930
So, it has done a good job in predicting the customers with a churn value of 0.
46
00:03:40,930 --> 00:03:46,250
A good thing about the confusion matrix is that it shows the model’s ability to correctly
47
00:03:46,250 --> 00:03:49,360
predict or separate the classes.
48
00:03:49,360 --> 00:03:54,960
In the specific case of a binary classifier, such as this example, we can interpret these
49
00:03:54,960 --> 00:04:03,190
numbers as the count of true positives, false positives, true negatives, and false negatives.
50
00:04:03,190 --> 00:04:09,060
Based on the count of each section, we can calculate the precision and recall of each
51
00:04:09,060 --> 00:04:10,420
label.
52
00:04:10,420 --> 00:04:15,980
Precision is a measure of the accuracy, provided that a class label has been predicted.
53
00:04:15,980 --> 00:04:24,440
It is defined by: precision = True Positive / (True Positive + False Positive).
54
00:04:24,440 --> 00:04:27,810
And Recall is the true positive rate.
55
00:04:27,810 --> 00:04:35,850
It is defined as: Recall = True Positive / (True Positive + False Negative).
56
00:04:35,850 --> 00:04:41,800
So, we can calculate the precision and recall of each class.
57
00:04:41,800 --> 00:04:47,470
Now we’re in the position to calculate the F1 scores for each label, based on the precision
58
00:04:47,470 --> 00:04:50,420
and recall of that label.
59
00:04:50,420 --> 00:04:57,580
The F1 score is the harmonic average of the precision and recall, where an F1 score reaches
60
00:04:57,580 --> 00:05:05,180
its best value at 1 (which represents perfect precision and recall) and its worst at 0.
61
00:05:05,180 --> 00:05:12,030
It is a good way to show that a classifier has a good value for both recall and precision.
62
00:05:12,030 --> 00:05:16,530
It is defined using the F1-score equation.
63
00:05:16,530 --> 00:05:26,570
For example, the F1-score for class 0 (i.e. churn=0), is 0.83, and the F1-score for class 1
64
00:05:26,570 --> 00:05:31,780
(i.e. churn=1), is 0.55.
65
00:05:31,780 --> 00:05:38,290
And finally, we can tell the average accuracy for this classifier is the average of the
66
00:05:38,290 --> 00:05:42,970
F1-score for both labels, which is 0.72 in our case.
67
00:05:42,970 --> 00:05:49,780
Please notice that both Jaccard and F1-score can be used for multi-class classifiers as
68
00:05:49,780 --> 00:05:53,320
well, which is out of scope for this course.
69
00:05:53,320 --> 00:05:57,570
Now let's look at another accuracy metric for classifiers.
70
00:05:57,570 --> 00:06:03,389
Sometimes, the output of a classifier is the probability of a class label, instead of the
71
00:06:03,389 --> 00:06:04,389
label.
72
00:06:04,389 --> 00:06:10,169
For example, in logistic regression, the output can be the probability of customer churn,
73
00:06:10,169 --> 00:06:14,150
i.e., yes (or equals to 1).
74
00:06:14,150 --> 00:06:18,550
This probability is a value between 0 and 1.
75
00:06:18,550 --> 00:06:24,440
Logarithmic loss (also known as Log loss) measures the performance of a classifier where
76
00:06:24,440 --> 00:06:29,180
the predicted output is a probability value between 0 and 1.
77
00:06:29,180 --> 00:06:36,110
So, for example, predicting a probability of 0.13 when the actual label is 1, would
78
00:06:36,110 --> 00:06:40,470
be bad and would result in a high log loss.
79
00:06:40,470 --> 00:06:46,040
We can calculate the log loss for each row using the log loss equation, which measures
80
00:06:46,040 --> 00:06:49,490
how far each prediction is, from the actual label.
81
00:06:49,490 --> 00:06:55,479
Then, we calculate the average log loss across all rows of the test set.
82
00:06:55,479 --> 00:07:01,259
It is obvious that more ideal classifiers have progressively smaller values of log loss.
83
00:07:01,259 --> 00:07:06,600
So, the classifier with lower log loss has better accuracy.
84
00:07:06,600 --> 00:07:08,159
Thanks for watching!