Skip to content

Latest commit

 

History

History
46 lines (46 loc) · 3.23 KB

README.md

File metadata and controls

46 lines (46 loc) · 3.23 KB

Assignment Three

K-Mean Clustering Algorithms

Implement iterative k-means using Python. Please use the dataset attached for this problem:

  1. data.txt contains the dataset which has 4601 rows and 58 columns. Each row is a document represented as a 58-dimensional vector of features. Each component in the vector represents the importance of a word in the document.
  2. cent1.txt contains k initial cluster centroids. These centroids were chosen by selecting k = 10 random points from the input data.
  3. cent2.txt contains initial cluster centroids which are as far apart as possible, i.e. select 1st centroid c1 randomly, and then finding the point c2 that is farthest from c1, then selecting c3 which is farthest from c1 and c2, and so on. Note farthest measured based on the similarity measure you will use in the algorithm Set number of iterations (MAX ITER) to 20 and number of clusters k to 10 for all the experiments carried out. Your driver program should ensure that the correct amount of iterations are run Problem A Use the Euclidean distance as the distance measure for clustering, and use SSE to compute the cost function of clustering for every iteration t. 𝑆𝑆𝐸 = ∑ ∑ 𝑑𝑖𝑠𝑡 2 (𝑚𝑖, 𝑥) 𝑥∈𝐶𝑖 𝐾 𝑖=1 This means that, for your first iteration, you’ll be computing the cost function using the initial centroids located in one of the two text files. Run the k-means on data.txt using c1.txt and c2.txt. Generate a graph where you plot the cost function SSE(t) as a function of the number of iterations t=1...20 for c1.txt and also for c2.txt. You may use a single plot or two different plots, whichever you think best answers the theoretical questions we’re asking you about.
  4. What is the percentage change in cost after 10 iterations of the K-Means algorithm when the cluster centroids are initialized using c1.txt vs. c2.txt and the distance metric being used is Euclidean distance?
  5. Is random initialization of k-means using c1.txt better than initialization using c2.txt in terms of cost function SSE(t)? Explain your reasoning. (Hint: to be clear, the percentage refers to (cost[0]-cost[9])/cost[9].) Problem B Repeat the experiment as part A and answer the requirements 1 and 2 in problem but using cosine similarity as a measure similarity for clustering. Use SSE to compute the cost function of clustering for every iteration t. 𝐾 𝑆𝑆𝐸 = ∑ ∑ 𝑑𝑖𝑠𝑡 2 (𝑚𝑖 , 𝑥) 𝑖=1 𝑥∈𝐶𝑖 Note: in cosine similarity when its greater the points are closer (i.e. more similar) Is Euclidean distance K-Means better than cosine similarity K-Mean in terms of cost function? Explain your reasoning. Problem C Use the Silhouette Coefficient to find the best number of clusters k for the data set. Using file cent2.txt for initial centroids i.e. select 1st centroid c1 randomly, and then finding the point c2 that is farthest from c1, then selecting c3 which is farthest from c1 and c2, and so on. Note farthest measured based on the similarity measure you will use in the algorithm. Find the best number k of cluster in both:
  6. When Euclidean distance is used as the distance measure for clustering in both cases
  7. When cosine similarity is used as a measure similarity for clustering in both cases