K-means clustering

The goal of this K-means clustering algorithm is to find K groups in the data, with each group having similar data points. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

The K value is assigned randomly at the beginning of the algorithm and different variations of results could be obtained by altering the K value. Once the algorithm sequence of activities is initiated after the selection of K, as depicted in the following points, we find that there are two major steps that keep repeating, until there is no further scope for changes in the clusters.

The two major steps that get repeated are Step 2 and Step 3, depicted as follows:

Step 2: Assigning the data point from the dataset to any of the K clusters. This is done by calculating the distance of the data point from the cluster centroid. As specified, any one of the distance functions that we discussed already could be used for this calculation.
Step 3: Here again, recalibration of the centroid occurs. This is done by taking the mean of all data points assigned to that centroid cluster.

The final output of the algorithm is K clusters that have similar data points:

Select k-seeds d(k_i,kj) > d_min
Assign points to clusters according to minimum distance:

Compute new cluster centroids:

Reassign points to the cluster (as in Step 2)
Iterate until no points change the cluster.

Here are some areas where clustering algorithms are used:

City planning
Earthquake studies
Insurance
Marketing
Medicine, for the analysis of antimicrobial activity and medical imaging
Crime analysis
Robotics, for anomaly detection and natural language processing

本周热推：

企业大数据处理：Spark、Druid、Flume与Kafka应用实践 Modern Programming: Object Oriented Programming and Best Practices Artificial Intelligence for Big Data Data Science Projects with Python Python数据分析从小白到专家