Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Model Distillation

Model Distillation

A compression technique that trains a small student model to mimic a larger teacher model.

Year: 2015Generality: 713
Back to Vocab

Model distillation is a knowledge transfer technique in which a compact "student" neural network is trained to replicate the behavior of a larger, more capable "teacher" model. Rather than learning directly from raw training labels, the student is trained on the teacher's output distribution — the soft probabilities assigned across all classes. These soft labels carry richer information than hard ground-truth labels, encoding the teacher's learned sense of similarity between categories (for instance, that a given image is 90% likely to be a cat but also somewhat dog-like). This richer supervisory signal allows the student to absorb nuanced representations that would be difficult to learn from sparse one-hot labels alone.

The mechanics of distillation typically involve a temperature-scaled softmax applied to the teacher's logits, which smooths the output distribution and amplifies the informational content of low-probability classes. The student minimizes a combined loss: a standard cross-entropy term against ground-truth labels and a distillation term measuring divergence from the teacher's softened outputs. The relative weighting of these two terms is a key hyperparameter. More advanced variants extend beyond output matching to intermediate-layer distillation, where the student is also trained to reproduce the teacher's internal feature representations, attention maps, or hidden states.

Model distillation has become a cornerstone technique for deploying large-scale models in resource-constrained environments such as mobile devices, embedded systems, and real-time inference pipelines. It has proven especially impactful in natural language processing, where distilled models like DistilBERT achieve roughly 97% of BERT's performance at 40% of its size and 60% of its inference speed. Beyond compression, distillation can improve generalization by acting as a form of regularization, and it underpins ensemble distillation, where knowledge from multiple teacher models is consolidated into a single efficient network.

Related

Related

Distillation
Distillation

Compressing a large teacher model's knowledge into a smaller, efficient student model.

Generality: 792
Teacher Model
Teacher Model

A large, pre-trained model that transfers knowledge to a smaller student model.

Generality: 620
Distillation Tax
Distillation Tax

Performance ceiling when training smaller models from larger model outputs

Generality: 519
Model Compression
Model Compression

Techniques that shrink machine learning models while preserving predictive accuracy.

Generality: 795
TAID (Temporally Adaptive Interpolated Distillation)
TAID (Temporally Adaptive Interpolated Distillation)

A distillation technique that aligns teacher and student models across differing temporal resolutions.

Generality: 380
Post-Training
Post-Training

Techniques applied after initial training to refine, compress, or adapt neural networks.

Generality: 694