Training models using both small labeled datasets and large unlabeled datasets together.
Semi-supervised learning is a machine learning paradigm that trains models using a combination of a small amount of labeled data and a much larger pool of unlabeled data. It occupies the space between supervised learning, which requires fully labeled datasets, and unsupervised learning, which operates without any labels at all. This approach is especially valuable in domains where labeling data is costly, slow, or requires specialized expertise — such as medical imaging, speech recognition, or natural language processing — while raw unlabeled data is abundant and cheap to collect.
The core assumption underlying most semi-supervised methods is that the structure of the unlabeled data carries meaningful information about the underlying data distribution, which can guide the learning process. Several techniques exploit this idea in different ways. Self-training iteratively uses a model's own high-confidence predictions on unlabeled examples as pseudo-labels, expanding the effective training set over successive rounds. Co-training trains two models on different feature views of the data, allowing each to label examples for the other. Graph-based methods construct a similarity graph over all data points and propagate labels from labeled nodes to unlabeled ones. More recently, consistency regularization methods — used in approaches like MixMatch, FixMatch, and UDA — enforce that a model's predictions remain stable under data augmentation, leveraging unlabeled data to improve generalization.
The practical impact of semi-supervised learning has grown substantially with deep learning. Modern methods can approach the performance of fully supervised models using only a fraction of the labeled examples, dramatically reducing annotation costs. This is particularly significant in fields like healthcare, where expert annotation is scarce and expensive, or in low-resource languages where labeled corpora are limited.
Semi-supervised learning also connects closely to related paradigms such as self-supervised learning, transfer learning, and active learning, all of which address the challenge of learning effectively when labeled data is scarce. As datasets grow larger and labeling bottlenecks persist, semi-supervised techniques remain a critical tool for building capable models under real-world constraints.