Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Clustering

Clustering

An unsupervised learning technique that groups similar data points together automatically.

Year: 1967Generality: 838
Back to Vocab

Clustering is an unsupervised machine learning technique that organizes data points into groups, called clusters, such that members within a cluster are more similar to one another than to members of other clusters. Unlike supervised learning, clustering requires no labeled training data — the algorithm discovers structure inherently present in the dataset. This makes it especially valuable when the underlying categories or patterns are unknown in advance, and the goal is exploratory analysis or data compression rather than prediction against a known target.

Several algorithmic families approach clustering differently. K-means partitions data into a fixed number of clusters by iteratively assigning points to the nearest centroid and updating centroids until convergence — simple and scalable, but sensitive to initialization and assumes spherical clusters. Hierarchical clustering builds a tree of nested groupings (a dendrogram) by successively merging or splitting clusters, offering interpretability without requiring a predetermined cluster count. Density-based methods like DBSCAN identify clusters as dense regions separated by sparser areas, naturally handling irregular shapes and flagging outliers as noise. More recent approaches include Gaussian Mixture Models, which treat clusters as probability distributions, and spectral clustering, which uses graph-theoretic techniques to capture complex geometric structure.

Clustering matters because real-world data rarely arrives pre-labeled, yet structure is almost always present. In customer segmentation, it reveals distinct behavioral groups without prior assumptions. In genomics, it uncovers gene expression patterns linked to disease subtypes. In computer vision, it underlies image segmentation and feature quantization. In natural language processing, it groups semantically similar documents or embeddings. The technique also plays a supporting role in semi-supervised learning and as a preprocessing step before supervised modeling.

Evaluating clustering quality is inherently difficult since there is no ground truth to compare against. Internal metrics like silhouette score and Davies-Bouldin index measure cohesion and separation within the discovered structure, while external metrics like adjusted Rand index apply when reference labels are available for validation. Choosing the right algorithm and the right number of clusters remains as much an art as a science, requiring domain knowledge alongside quantitative guidance.

Related

Related

Centroid-Based Clustering
Centroid-Based Clustering

Unsupervised learning method grouping data points by proximity to cluster centroids.

Generality: 694
Unsupervised Learning
Unsupervised Learning

Machine learning that discovers hidden patterns in data without labeled examples.

Generality: 850
Agglomerative Clustering
Agglomerative Clustering

A bottom-up hierarchical clustering method that progressively merges similar data points.

Generality: 694
Mixture Model
Mixture Model

A probabilistic model representing data as drawn from multiple component distributions.

Generality: 796
Data Mining
Data Mining

Automatically discovering patterns, correlations, and insights from large datasets.

Generality: 836
kNN (k-Nearest Neighbors)
kNN (k-Nearest Neighbors)

A non-parametric algorithm that classifies points by majority vote of their nearest neighbors.

Generality: 796