Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Distillation

Distillation

Compressing a large teacher model's knowledge into a smaller, efficient student model.

Year: 2015Generality: 792
Back to Vocab

Knowledge distillation is a model compression technique in which a compact "student" model is trained to replicate the behavior of a larger, more capable "teacher" model. Rather than training the student solely on ground-truth labels, distillation leverages the teacher's output probability distributions — often called soft labels — which encode nuanced information about class relationships and model uncertainty. For example, a teacher classifying an image of a cat might assign small but nonzero probabilities to related categories like "lynx" or "tiger," and these inter-class similarities provide a richer training signal than a simple one-hot label ever could.

The mechanics of distillation typically involve minimizing a combination of two losses: a standard cross-entropy loss against the true labels, and a divergence loss (often KL divergence) between the student's and teacher's output distributions, computed at a raised "temperature" that softens the probability peaks and amplifies the informative structure in the teacher's predictions. More advanced variants extend this idea beyond output logits, matching intermediate feature representations, attention maps, or relational structures between data points — approaches collectively known as feature-based or relation-based distillation.

Distillation matters enormously in practical deployment. Large models like GPT-scale transformers or ResNet-family vision networks deliver strong performance but are expensive to run at inference time. Distillation offers a principled path to shrinking these models for use on mobile devices, embedded systems, or latency-sensitive APIs without retraining from scratch on limited data. Notable real-world examples include DistilBERT, which retains roughly 97% of BERT's language understanding performance at 40% fewer parameters, and TinyBERT, which applies distillation at multiple layers for even greater compression.

Beyond compression, distillation has found broader applications: it is used in ensemble learning to merge multiple models into one, in semi-supervised learning to propagate knowledge from labeled to unlabeled data, and increasingly in reinforcement learning and generative modeling. Its versatility and effectiveness have made it one of the most widely adopted techniques in modern machine learning pipelines.

Related

Related

Model Distillation
Model Distillation

A compression technique that trains a small student model to mimic a larger teacher model.

Generality: 713
Distillation Tax
Distillation Tax

Performance ceiling when training smaller models from larger model outputs

Generality: 519
Teacher Model
Teacher Model

A large, pre-trained model that transfers knowledge to a smaller student model.

Generality: 620
TAID (Temporally Adaptive Interpolated Distillation)
TAID (Temporally Adaptive Interpolated Distillation)

A distillation technique that aligns teacher and student models across differing temporal resolutions.

Generality: 380
Model Compression
Model Compression

Techniques that shrink machine learning models while preserving predictive accuracy.

Generality: 795
Teacher Committee
Teacher Committee

An ensemble of expert models that jointly guide a student model's training.

Generality: 520