K-means clustering

The goal of this K-means clustering algorithm is to find K groups in the data, with each group having similar data points. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

The K value is assigned randomly at the beginning of the algorithm and different variations of results could be obtained by altering the K value. Once the algorithm sequence of activities is initiated after the selection of K, as depicted in the following points, we find that there are two major steps that keep repeating, until there is no further scope for changes in the clusters.

The two major steps that get repeated are Step 2 and Step 3, depicted as follows:

  • Step 2: Assigning the data point from the dataset to any of the K clusters. This is done by calculating the distance of the data point from the cluster centroid. As specified, any one of the distance functions that we discussed already could be used for this calculation.
  • Step 3: Here again, recalibration of the centroid occurs. This is done by taking the mean of all data points assigned to that centroid cluster.

The final output of the algorithm is K clusters that have similar data points:

  1. Select k-seeds d(ki,kj) > dmin
  2. Assign points to clusters according to minimum distance: 
  1. Compute new cluster centroids:
  1. Reassign points to the cluster (as in Step 2)
  2. Iterate until no points change the cluster. 

Here are some areas where clustering algorithms are used:

  • City planning
  • Earthquake studies
  • Insurance
  • Marketing
  • Medicine, for the analysis of antimicrobial activity and medical imaging 
  • Crime analysis
  • Robotics, for anomaly detection and natural language processing