A training strategy where a model selectively queries the most informative unlabeled examples to learn efficiently.
Active learning is a machine learning paradigm in which a model actively participates in its own training by selecting which data points it wants labeled, rather than passively consuming a pre-labeled dataset. The core motivation is practical: in many real-world domains — medical imaging, legal document analysis, scientific literature — acquiring raw data is cheap but obtaining expert annotations is expensive and slow. By strategically choosing which examples to present to a human annotator, an active learner aims to achieve high accuracy with far fewer labeled samples than standard supervised learning would require.
The selection process is driven by query strategies that estimate which unlabeled examples would be most informative if labeled. Uncertainty sampling picks examples the model is least confident about — those near a decision boundary, for instance. Query-by-committee maintains an ensemble of models and selects examples where the ensemble disagrees most. Expected model change and expected error reduction strategies choose examples that would most alter the model's parameters or most reduce generalization error, respectively. In practice, uncertainty sampling is the most widely used due to its simplicity and computational efficiency, while more sophisticated strategies are applied when the labeling budget is extremely tight.
Active learning has become increasingly relevant as deep learning models demand massive labeled datasets that are costly to produce. It integrates naturally with semi-supervised learning and self-supervised pretraining, where a small actively-selected labeled set fine-tunes a model pretrained on abundant unlabeled data. Applications span drug discovery, autonomous driving data curation, and low-resource NLP. A persistent challenge is the cold-start problem — the model needs some initial labeled data to make meaningful queries — and the risk of introducing sampling bias if the query strategy systematically avoids certain regions of the input space.