-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathen_K-Nearest Neighbors.srt
448 lines (336 loc) · 10.8 KB
/
en_K-Nearest Neighbors.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
0
00:00:00,590 --> 00:00:02,600
Hello and welcome!
1
00:00:02,600 --> 00:00:06,220
In this video, we’ll be covering the k-nearest neighbors algorithm.
2
00:00:06,220 --> 00:00:09,210
So let’s get started.
3
00:00:09,210 --> 00:00:14,530
Imagine that a telecommunications provider has segmented its customer base by service
4
00:00:14,530 --> 00:00:20,140
usage patterns, categorizing the customers into four groups.
5
00:00:20,140 --> 00:00:25,470
If demographic data can be used to predict group membership, the company can customize
6
00:00:25,470 --> 00:00:29,010
offers for individual prospective customers.
7
00:00:29,010 --> 00:00:32,160
This is a classification problem.
8
00:00:32,160 --> 00:00:38,140
That is, given the dataset, with predefined labels, we need to build a model to be used
9
00:00:38,140 --> 00:00:42,950
to predict the class of a new or unknown case.
10
00:00:42,950 --> 00:00:49,510
The example focuses on using demographic data, such as region, age, and marital status, to
11
00:00:49,510 --> 00:00:52,020
predict usage patterns.
12
00:00:52,020 --> 00:00:57,879
The target field, called custcat, has four possible values that correspond to the four
13
00:00:57,879 --> 00:01:06,670
customer groups, as follows: Basic Service, E-Service, Plus Service, and Total Service.
14
00:01:06,670 --> 00:01:13,479
Our objective is to build a classifier, for example using the rows 0 to 7, to predict
15
00:01:13,479 --> 00:01:15,319
the class of row 8.
16
00:01:15,319 --> 00:01:21,889
We will use a specific type of classification called K-nearest neighbor.
17
00:01:21,889 --> 00:01:27,700
Just for sake of demonstration, let’s use only two fields as predictors - specifically,
18
00:01:27,700 --> 00:01:32,640
Age and Income, and then plot the customers based on their group membership.
19
00:01:32,640 --> 00:01:39,319
Now, let’s say that we have a new customer, for example, record number 8 with a known
20
00:01:39,319 --> 00:01:41,430
age and income.
21
00:01:41,430 --> 00:01:45,240
How can we find the class of this customer?
22
00:01:45,240 --> 00:01:51,969
Can we find one of the closest cases and assign the same class label to our new customer?
23
00:01:51,969 --> 00:01:58,770
Can we also say that the class of our new customer is most probably group 4 (i.e. total
24
00:01:58,770 --> 00:02:02,719
service) because its nearest neighbor is also of class 4?
25
00:02:02,719 --> 00:02:04,250
Yes, we can.
26
00:02:04,250 --> 00:02:07,969
In fact, it is the first-nearest neighbor.
27
00:02:07,969 --> 00:02:14,349
Now, the question is, “To what extent can we trust our judgment, which is based on the
28
00:02:14,349 --> 00:02:16,140
first nearest neighbor?”
29
00:02:16,140 --> 00:02:21,549
It might be a poor judgment, especially if the first nearest neighbor is a very specific
30
00:02:21,549 --> 00:02:24,459
case, or an outlier, correct?
31
00:02:24,459 --> 00:02:28,060
Now, let’s look at our scatter plot again.
32
00:02:28,060 --> 00:02:34,200
Rather than choose the first nearest neighbor, what if we chose the five nearest neighbors,
33
00:02:34,200 --> 00:02:39,269
and did a majority vote among them to define the class of our new customer?
34
00:02:39,269 --> 00:02:44,980
In this case, we’d see that three out of five nearest neighbors tell us to go for class
35
00:02:44,980 --> 00:02:47,400
3, which is ”Plus service.”
36
00:02:47,400 --> 00:02:49,930
Doesn’t this make more sense?
37
00:02:49,930 --> 00:02:52,790
Yes, in fact, it does!
38
00:02:52,790 --> 00:02:58,439
In this case, the value of K in the k-nearest neighbors algorithm is 5.
39
00:02:58,439 --> 00:03:04,110
This example highlights the intuition behind the k-nearest neighbors algorithm.
40
00:03:04,110 --> 00:03:08,870
Now, let’s define the k-nearest neighbors.
41
00:03:08,870 --> 00:03:15,359
The k-nearest-neighbors algorithm is a classification algorithm that takes a bunch of labelled points
42
00:03:15,359 --> 00:03:19,469
and uses them to learn how to label other points.
43
00:03:19,469 --> 00:03:24,420
This algorithm classifies cases based on their similarity to other cases.
44
00:03:24,420 --> 00:03:31,249
In k-nearest neighbors, data points that are near each other are said to be “neighbors.”
45
00:03:31,249 --> 00:03:37,879
K-nearest neighbors is based on this paradigm: “Similar cases with the same class labels
46
00:03:37,879 --> 00:03:39,560
are near each other.”
47
00:03:39,560 --> 00:03:45,730
Thus, the distance between two cases is a measure of their dissimilarity.
48
00:03:45,730 --> 00:03:50,969
There are different ways to calculate the similarity, or conversely, the distance or
49
00:03:50,969 --> 00:03:53,700
dissimilarity of two data points.
50
00:03:53,700 --> 00:03:57,709
For example, this can be done using Euclidian distance.
51
00:03:57,709 --> 00:04:02,420
Now, let’s see how the k-nearest neighbors algorithm actually works.
52
00:04:02,420 --> 00:04:08,930
In a classification problem, the k-nearest neighbors algorithm works as follows:
53
00:04:09,930 --> 00:04:13,870
1. Pick a value for K.
54
00:04:13,870 --> 00:04:20,769
2. Calculate the distance from the new case (holdout from each of the cases in the dataset).
55
00:04:22,139 --> 00:04:27,530
3. Search for the K observations in the training data that are ‘nearest’ to the measurements
56
00:04:27,530 --> 00:04:30,020
of the unknown data point.
57
00:04:31,300 --> 00:04:36,639
And 4, predict the response of the unknown data point using the most popular response value from
58
00:04:36,639 --> 00:04:39,410
the K nearest neighbors.
59
00:04:39,410 --> 00:04:43,830
There are two parts in this algorithm that might be a bit confusing.
60
00:04:43,830 --> 00:04:51,390
First, how to select the correct K; and second, how to compute the similarity between cases,
61
00:04:51,390 --> 00:04:53,780
for example, among customers?
62
00:04:53,780 --> 00:05:00,470
Let’s first start with second concern, that is, how can we calculate the similarity between
63
00:05:00,470 --> 00:05:03,419
two data points?
64
00:05:03,419 --> 00:05:07,650
Assume that we have two customers, customer 1 and customer 2.
65
00:05:07,650 --> 00:05:13,900
And, for a moment, assume that these 2 customers have only one feature, Age.
66
00:05:13,900 --> 00:05:19,970
We can easily use a specific type of Minkowski distance to calculate the distance of these
67
00:05:19,970 --> 00:05:21,400
2 customers.
68
00:05:21,400 --> 00:05:25,080
It is indeed, the Euclidian distance.
69
00:05:25,080 --> 00:05:33,690
Distance of x1 from x2 is root of 34 minus 30 to power of 2, which is 4.
70
00:05:33,690 --> 00:05:40,009
What about if we have more than one feature, for example Age and Income?
71
00:05:40,009 --> 00:05:46,410
If we have income and age for each customer, we can still use the same formula, but this
72
00:05:46,410 --> 00:05:50,669
time, we’re using it in a 2-dimensional space.
73
00:05:50,669 --> 00:05:56,169
We can also use the same distance matrix for multi-dimensional vectors.
74
00:05:56,169 --> 00:06:02,389
Of course, we have to normalize our feature set to get the accurate dissimilarity measure.
75
00:06:02,389 --> 00:06:07,780
There are other dissimilarity measures as well that can be used for this purpose but,
76
00:06:07,780 --> 00:06:13,699
as mentioned, it is highly dependent on data type and also the domain that classification
77
00:06:13,699 --> 00:06:16,129
is done for it.
78
00:06:16,129 --> 00:06:22,250
As mentioned, K in k-nearest neighbors, is the number of nearest neighbors to examine.
79
00:06:22,250 --> 00:06:25,909
It is supposed to be specified by the user.
80
00:06:25,909 --> 00:06:28,720
So, how do we choose the right K?
81
00:06:28,720 --> 00:06:35,990
Assume that we want to find the class of the customer noted as question mark on the chart.
82
00:06:35,990 --> 00:06:42,710
What happens if we choose a very low value of K, let’s say, k=1?
83
00:06:42,710 --> 00:06:47,140
The first nearest point would be Blue, which is class 1.
84
00:06:47,140 --> 00:06:52,789
This would be a bad prediction, since more of the points around it are Magenta, or class 4.
85
00:06:53,789 --> 00:06:59,879
In fact, since its nearest neighbor is Blue, we can say that we captured the noise in the
86
00:06:59,879 --> 00:07:04,889
data, or we chose one of the points that was an anomaly in the data.
87
00:07:04,889 --> 00:07:12,039
A low value of K causes a highly complex model as well, which might result in over-fitting
88
00:07:12,039 --> 00:07:13,750
of the model.
89
00:07:13,750 --> 00:07:20,880
It means the prediction process is not generalized enough to be used for out-of-sample cases.
90
00:07:20,880 --> 00:07:26,199
Out-of-sample data is data that is outside of the dataset used to train the model.
91
00:07:26,199 --> 00:07:31,350
In other words, it cannot be trusted to be used for prediction of unknown samples.
92
00:07:31,350 --> 00:07:37,800
It’s important to remember that over-fitting is bad, as we want a general model that works
93
00:07:37,800 --> 00:07:42,210
for any data, not just the data used for training.
94
00:07:42,210 --> 00:07:47,699
Now, on the opposite side of the spectrum, if we choose a very high value of K, such
95
00:07:47,699 --> 00:07:52,639
as K=20, then the model becomes overly generalized.
96
00:07:52,639 --> 00:07:57,150
So, how we can find the best value for K?
97
00:07:57,150 --> 00:08:01,830
The general solution is to reserve a part of your data for testing the accuracy of the
98
00:08:01,830 --> 00:08:02,830
model.
99
00:08:02,830 --> 00:08:09,969
Once you’ve done so, choose k =1, and then use the training part for modeling, and calculate
100
00:08:09,969 --> 00:08:15,080
the accuracy of prediction using all samples in your test set.
101
00:08:15,080 --> 00:08:21,090
Repeat this process, increasing the k, and see which k is best for your model.
102
00:08:21,090 --> 00:08:28,289
For example, in our case, k=4 will give us the best accuracy.
103
00:08:28,289 --> 00:08:34,170
Nearest neighbors analysis can also be used to compute values for a continuous target.
104
00:08:34,170 --> 00:08:40,190
In this situation, the average or median target value of the nearest neighbors is used to
105
00:08:40,190 --> 00:08:44,090
obtain the predicted value for the new case.
106
00:08:44,090 --> 00:08:49,480
For example, assume that you are predicting the price of a home based on its feature set,
107
00:08:49,480 --> 00:08:55,440
such as number of rooms, square footage, the year it was built, and so on.
108
00:08:55,440 --> 00:09:01,340
You can easily find the three nearest neighbor houses, of course -- not only based on distance,
109
00:09:01,340 --> 00:09:06,580
but also based on all the attributes, and then predict the price of the house, as the
110
00:09:06,580 --> 00:09:09,450
median of neighbors.
111
00:09:09,450 --> 00:09:12,540
This concludes this video. Thanks for watching!