Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Distillation Tax

Distillation Tax

Performance ceiling when training smaller models from larger model outputs

Year: 2024Generality: 519
Back to Vocab

Distillation Tax refers to the fundamental performance gap between a student model trained through knowledge distillation and the teacher model from which it learns. Knowledge distillation—training a small model to mimic a large one—is an attractive approach for creating efficient, deployable models. However, empirical evidence shows that student models consistently underperform their teachers, with a gap that can be difficult to close regardless of training technique or data quantity. This gap is the distillation tax. It's not merely a temporary setback that more training resolves, but a structural limitation: the student inherits not just the teacher's knowledge, but also its blind spots, failure modes, and implicit biases.

How Distillation Tax manifests involves the student learning a compressed approximation of the teacher's behavior. The teacher model encodes capabilities and knowledge in its weights; distillation attempts to extract and transfer this through soft targets (probability distributions) and matching hidden states. However, the student has fewer parameters, lower capacity, and is forced to make tradeoffs about what to memorize versus what to generalize. If the teacher makes a systematic error or has a particular weakness, the student often amplifies it. Additionally, the student cannot invent knowledge the teacher lacks; it's upper-bounded by the teacher's competence. Even with perfect knowledge transfer, the student's smaller capacity means it cannot express everything the teacher knows. This means a 7B model trained on outputs from a 70B teacher will have a performance ceiling significantly below the teacher, no matter how careful the distillation process.

Why Distillation Tax matters is critical for AI development strategy. As models scale up, there's pressure to create smaller, cheaper, faster versions. Distillation is a key technique for this, yet the tax reminds us that efficiency comes at a cost. Organizations must decide: is a distilled model sufficient for our use case, or do we need the full capability? The existence of a fundamental tax also suggests that truly capable, efficient models may require alternatives to pure distillation—whether that's architecture innovation, different training paradigms, or accepting the tradeoff as the cost of deployment.

Related

Related

Distillation
Distillation

Compressing a large teacher model's knowledge into a smaller, efficient student model.

Generality: 792
Model Distillation
Model Distillation

A compression technique that trains a small student model to mimic a larger teacher model.

Generality: 713
Teacher Model
Teacher Model

A large, pre-trained model that transfers knowledge to a smaller student model.

Generality: 620
Alignment Tax
Alignment Tax

Performance cost of making AI models safer and aligned with human values

Generality: 693
TAID (Temporally Adaptive Interpolated Distillation)
TAID (Temporally Adaptive Interpolated Distillation)

A distillation technique that aligns teacher and student models across differing temporal resolutions.

Generality: 380
Saturation Effect
Saturation Effect

Diminishing performance returns as model complexity or training data increases beyond a threshold.

Generality: 590