An unsupervised learning technique that groups similar data points together automatically.
Clustering is an unsupervised machine learning technique that organizes data points into groups, called clusters, such that members within a cluster are more similar to one another than to members of other clusters. Unlike supervised learning, clustering requires no labeled training data — the algorithm discovers structure inherently present in the dataset. This makes it especially valuable when the underlying categories or patterns are unknown in advance, and the goal is exploratory analysis or data compression rather than prediction against a known target.
Several algorithmic families approach clustering differently. K-means partitions data into a fixed number of clusters by iteratively assigning points to the nearest centroid and updating centroids until convergence — simple and scalable, but sensitive to initialization and assumes spherical clusters. Hierarchical clustering builds a tree of nested groupings (a dendrogram) by successively merging or splitting clusters, offering interpretability without requiring a predetermined cluster count. Density-based methods like DBSCAN identify clusters as dense regions separated by sparser areas, naturally handling irregular shapes and flagging outliers as noise. More recent approaches include Gaussian Mixture Models, which treat clusters as probability distributions, and spectral clustering, which uses graph-theoretic techniques to capture complex geometric structure.
Clustering matters because real-world data rarely arrives pre-labeled, yet structure is almost always present. In customer segmentation, it reveals distinct behavioral groups without prior assumptions. In genomics, it uncovers gene expression patterns linked to disease subtypes. In computer vision, it underlies image segmentation and feature quantization. In natural language processing, it groups semantically similar documents or embeddings. The technique also plays a supporting role in semi-supervised learning and as a preprocessing step before supervised modeling.
Evaluating clustering quality is inherently difficult since there is no ground truth to compare against. Internal metrics like silhouette score and Davies-Bouldin index measure cohesion and separation within the discovered structure, while external metrics like adjusted Rand index apply when reference labels are available for validation. Choosing the right algorithm and the right number of clusters remains as much an art as a science, requiring domain knowledge alongside quantitative guidance.