Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Knowledge Distillation

Knowledge Distillation

Training a compact student model to mimic a larger teacher model's behavior through soft target distributions.

Year: 2015Generality: 880
Back to Vocab

Knowledge distillation transfers capabilities from a large, high-performing model to a smaller, more efficient one by training the student on the teacher's output distributions rather than raw labels.

Instead of learning from one-hot correct answers, the student learns the full probability distribution over all possible outputs that the teacher produces. A "cat" image might receive 0.7 probability for the cat class and 0.25 for dog, and the student internalizes why those relative probabilities matter. The soft targets capture dark knowledge: relationships between classes and confidence calibration that plain labels cannot express. Modern approaches often combine temperature-scaled softmax outputs with contrastive learning signals to preserve both the teacher's predictions and learned representations.

Distilled models can achieve 90–95% of the teacher's performance while requiring substantially less compute and memory, but performance gaps remain for rare or ambiguous cases where the teacher's probability distributions are themselves uninformative. The teacher must be substantially larger and more capable than the student for the knowledge transfer to be meaningful, and the student inherits any biases present in the teacher.

The theoretical conditions under which dark knowledge is maximally useful for a student are not fully characterized. Whether certain knowledge types — factual, procedural, stylistic — distill more effectively than others remains an open empirical question. The relationship between teacher-student capacity ratio and downstream task performance is not yet captured by reliable scaling laws.