-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathen_More on K-Means.srt
172 lines (129 loc) · 4.56 KB
/
en_More on K-Means.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
0
00:00:00,590 --> 00:00:05,360
Hello, and welcome! In this video, we’ll look at k-Means accuracy
1
00:00:05,360 --> 00:00:08,140
and characteristics. Let’s get started.
2
00:00:08,140 --> 00:00:13,910
Let’s define the algorithm more concretely before we talk about its accuracy.
3
00:00:13,910 --> 00:00:20,840
A k-Means algorithm works by randomly placing k centroids, one for each cluster.
4
00:00:20,840 --> 00:00:24,130
The farther apart the clusters are placed, the better.
5
00:00:24,130 --> 00:00:30,369
The next step is to calculate the distance of each data point (or object) from the centroids.
6
00:00:30,369 --> 00:00:35,250
Euclidean distance is used to measure the distance from the object to the centroid.
7
00:00:35,250 --> 00:00:40,090
Please note, however, that you can also use different types of distance measurements,
8
00:00:40,090 --> 00:00:44,450
not just Euclidean distance. Euclidean distance is used because it’s
9
00:00:44,450 --> 00:00:49,520
the most popular. Then, assign each data point (or object) to
10
00:00:49,520 --> 00:00:56,070
its closest centroid, creating a group. Next, once each data point has been classified
11
00:00:56,070 --> 00:01:02,550
to a group, recalculate the position of the k centroids. The new centroid position is
12
00:01:02,550 --> 00:01:06,539
determined by the mean of all points in the group.
13
00:01:06,539 --> 00:01:11,439
Finally, this continues until the centroids no longer move.
14
00:01:11,439 --> 00:01:16,700
Now the question is, "How can we evaluate the 'goodness' of the clusters formed
15
00:01:16,700 --> 00:01:20,789
by k-Means?" In other words, "How do we calculate the accuracy
16
00:01:20,789 --> 00:01:25,179
of k-Means clustering?” One way is to compare the clusters with the
17
00:01:25,179 --> 00:01:31,070
ground truth, if it’s available. However, because k-Means is an unsupervised
18
00:01:31,070 --> 00:01:36,420
algorithm, we usually don’t have ground truth in real world problems to be used.
19
00:01:36,420 --> 00:01:42,579
But, there is still a way to say how bad each cluster is, based on the objective of the
20
00:01:42,579 --> 00:01:46,739
k-Means. This value is the average distance between
21
00:01:46,739 --> 00:01:52,539
data points within a cluster. Also, average of the distances of data points
22
00:01:52,539 --> 00:01:59,029
from their cluster centroids can be used as a metric of error for the clustering algorithm.
23
00:01:59,029 --> 00:02:05,409
Essentially, determining the number of clusters in a data set, or k, as in the k-Means algorithm,
24
00:02:05,409 --> 00:02:12,000
is a frequent problem in data clustering. The correct choice of k is often ambiguous,
25
00:02:12,000 --> 00:02:16,290
because it’s very dependent on the shape and scale of the distribution of points in
26
00:02:16,290 --> 00:02:20,340
a data set. There are some approaches to address this
27
00:02:20,340 --> 00:02:25,150
problem, but one of the techniques that is commonly used, is to run the clustering across
28
00:02:25,150 --> 00:02:31,000
the different values of K, and looking at a metric of accuracy for clustering.
29
00:02:31,000 --> 00:02:36,620
This metric can be “mean distance between data points and their cluster centroid,” which
30
00:02:36,620 --> 00:02:43,709
indicate how dense our clusters are, or to what extend we minimized the error of clustering.
31
00:02:43,709 --> 00:02:49,790
Then looking at the change of this metric, we can find the best value for k.
32
00:02:49,790 --> 00:02:55,049
But the problem is that with increasing the number of clusters, the distance of centroids
33
00:02:55,049 --> 00:03:02,879
to data points will always reduce. This means, increasing K will always decrease the “error.”
34
00:03:02,879 --> 00:03:10,329
So, the value of the metric as a function of K is plotted and the "elbow point" is determined,
35
00:03:10,329 --> 00:03:16,120
where the rate of decrease sharply shifts. It is the right K for clustering.
36
00:03:16,120 --> 00:03:19,090
This method is called the “elbow” method.
37
00:03:19,090 --> 00:03:26,400
So, let’s recap k-Means clustering: k-Means is a partitioned-based clustering, which is:
38
00:03:26,400 --> 00:03:30,900
a) Relatively efficient on medium and large sized datasets;
39
00:03:30,900 --> 00:03:37,300
b) Produces sphere-like clusters, because the clusters are shaped around the centroids;
40
00:03:37,300 --> 00:03:40,889
and c) Its drawback is that we should pre-specify
41
00:03:40,889 --> 00:03:44,870
the number of clusters, and this is not an easy task.
42
00:03:44,870 --> 00:03:46,270
Thanks for watching!