Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Centroid-Based Clustering

Centroid-Based Clustering

Unsupervised learning method grouping data points by proximity to cluster centroids.

Year: 1982Generality: 694
Back to Vocab

Centroid-based clustering is a family of unsupervised machine learning methods that partition a dataset into a predefined number of groups, where each group is represented by a central point called a centroid — typically the mean of all data points assigned to that cluster. The algorithm works iteratively: data points are assigned to the nearest centroid based on a distance metric (usually Euclidean distance), centroids are then recomputed as the average of their assigned points, and this process repeats until assignments stabilize. The most widely used algorithm in this family is k-means, though variants like k-medoids and k-medians offer greater robustness to outliers by using alternative centroid definitions.

The appeal of centroid-based clustering lies in its computational efficiency and interpretability. For large datasets, k-means scales well and produces clusters that are easy to describe and reason about. This has made it a go-to technique across domains including customer segmentation, image compression, document clustering, and anomaly detection. However, the method requires the number of clusters k to be specified in advance, which is not always known. Practitioners often use heuristics like the elbow method or silhouette scores to guide this choice.

A key limitation of centroid-based approaches is sensitivity to initialization — poor starting centroids can lead to suboptimal local minima. The k-means++ initialization strategy addresses this by selecting initial centroids with probability proportional to their distance from already-chosen centroids, significantly improving convergence quality. The method also struggles with non-convex cluster shapes, high-dimensional data, and datasets with significant class imbalance, where density-based or hierarchical alternatives may perform better.

Despite these limitations, centroid-based clustering remains one of the most practically important tools in the unsupervised learning toolkit. Its simplicity makes it an effective baseline and a useful preprocessing step for more complex pipelines, such as initializing neural network weights or reducing the output space for reinforcement learning agents. Ongoing research continues to extend the framework to handle streaming data, mixed data types, and distributed computing environments.

Related

Related

Clustering
Clustering

An unsupervised learning technique that groups similar data points together automatically.

Generality: 838
Agglomerative Clustering
Agglomerative Clustering

A bottom-up hierarchical clustering method that progressively merges similar data points.

Generality: 694
Unsupervised Learning
Unsupervised Learning

Machine learning that discovers hidden patterns in data without labeled examples.

Generality: 850
Mixture Model
Mixture Model

A probabilistic model representing data as drawn from multiple component distributions.

Generality: 796
GMM (Gaussian Mixture Models)
GMM (Gaussian Mixture Models)

Probabilistic models representing data as a weighted mixture of Gaussian distributions.

Generality: 731
kNN (k-Nearest Neighbors)
kNN (k-Nearest Neighbors)

A non-parametric algorithm that classifies points by majority vote of their nearest neighbors.

Generality: 796