Unsupervised learning method grouping data points by proximity to cluster centroids.
Centroid-based clustering is a family of unsupervised machine learning methods that partition a dataset into a predefined number of groups, where each group is represented by a central point called a centroid — typically the mean of all data points assigned to that cluster. The algorithm works iteratively: data points are assigned to the nearest centroid based on a distance metric (usually Euclidean distance), centroids are then recomputed as the average of their assigned points, and this process repeats until assignments stabilize. The most widely used algorithm in this family is k-means, though variants like k-medoids and k-medians offer greater robustness to outliers by using alternative centroid definitions.
The appeal of centroid-based clustering lies in its computational efficiency and interpretability. For large datasets, k-means scales well and produces clusters that are easy to describe and reason about. This has made it a go-to technique across domains including customer segmentation, image compression, document clustering, and anomaly detection. However, the method requires the number of clusters k to be specified in advance, which is not always known. Practitioners often use heuristics like the elbow method or silhouette scores to guide this choice.
A key limitation of centroid-based approaches is sensitivity to initialization — poor starting centroids can lead to suboptimal local minima. The k-means++ initialization strategy addresses this by selecting initial centroids with probability proportional to their distance from already-chosen centroids, significantly improving convergence quality. The method also struggles with non-convex cluster shapes, high-dimensional data, and datasets with significant class imbalance, where density-based or hierarchical alternatives may perform better.
Despite these limitations, centroid-based clustering remains one of the most practically important tools in the unsupervised learning toolkit. Its simplicity makes it an effective baseline and a useful preprocessing step for more complex pipelines, such as initializing neural network weights or reducing the output space for reinforcement learning agents. Ongoing research continues to extend the framework to handle streaming data, mixed data types, and distributed computing environments.