Compressing a large teacher model's knowledge into a smaller, efficient student model.
Knowledge distillation is a model compression technique in which a compact "student" model is trained to replicate the behavior of a larger, more capable "teacher" model. Rather than training the student solely on ground-truth labels, distillation leverages the teacher's output probability distributions — often called soft labels — which encode nuanced information about class relationships and model uncertainty. For example, a teacher classifying an image of a cat might assign small but nonzero probabilities to related categories like "lynx" or "tiger," and these inter-class similarities provide a richer training signal than a simple one-hot label ever could.
The mechanics of distillation typically involve minimizing a combination of two losses: a standard cross-entropy loss against the true labels, and a divergence loss (often KL divergence) between the student's and teacher's output distributions, computed at a raised "temperature" that softens the probability peaks and amplifies the informative structure in the teacher's predictions. More advanced variants extend this idea beyond output logits, matching intermediate feature representations, attention maps, or relational structures between data points — approaches collectively known as feature-based or relation-based distillation.
Distillation matters enormously in practical deployment. Large models like GPT-scale transformers or ResNet-family vision networks deliver strong performance but are expensive to run at inference time. Distillation offers a principled path to shrinking these models for use on mobile devices, embedded systems, or latency-sensitive APIs without retraining from scratch on limited data. Notable real-world examples include DistilBERT, which retains roughly 97% of BERT's language understanding performance at 40% fewer parameters, and TinyBERT, which applies distillation at multiple layers for even greater compression.
Beyond compression, distillation has found broader applications: it is used in ensemble learning to merge multiple models into one, in semi-supervised learning to propagate knowledge from labeled to unlabeled data, and increasingly in reinforcement learning and generative modeling. Its versatility and effectiveness have made it one of the most widely adopted techniques in modern machine learning pipelines.