A compression technique that trains a small student model to mimic a larger teacher model.
Model distillation is a knowledge transfer technique in which a compact "student" neural network is trained to replicate the behavior of a larger, more capable "teacher" model. Rather than learning directly from raw training labels, the student is trained on the teacher's output distribution — the soft probabilities assigned across all classes. These soft labels carry richer information than hard ground-truth labels, encoding the teacher's learned sense of similarity between categories (for instance, that a given image is 90% likely to be a cat but also somewhat dog-like). This richer supervisory signal allows the student to absorb nuanced representations that would be difficult to learn from sparse one-hot labels alone.
The mechanics of distillation typically involve a temperature-scaled softmax applied to the teacher's logits, which smooths the output distribution and amplifies the informational content of low-probability classes. The student minimizes a combined loss: a standard cross-entropy term against ground-truth labels and a distillation term measuring divergence from the teacher's softened outputs. The relative weighting of these two terms is a key hyperparameter. More advanced variants extend beyond output matching to intermediate-layer distillation, where the student is also trained to reproduce the teacher's internal feature representations, attention maps, or hidden states.
Model distillation has become a cornerstone technique for deploying large-scale models in resource-constrained environments such as mobile devices, embedded systems, and real-time inference pipelines. It has proven especially impactful in natural language processing, where distilled models like DistilBERT achieve roughly 97% of BERT's performance at 40% of its size and 60% of its inference speed. Beyond compression, distillation can improve generalization by acting as a form of regularization, and it underpins ensemble distillation, where knowledge from multiple teacher models is consolidated into a single efficient network.