Performance ceiling when training smaller models from larger model outputs
Distillation Tax refers to the fundamental performance gap between a student model trained through knowledge distillation and the teacher model from which it learns. Knowledge distillation—training a small model to mimic a large one—is an attractive approach for creating efficient, deployable models. However, empirical evidence shows that student models consistently underperform their teachers, with a gap that can be difficult to close regardless of training technique or data quantity. This gap is the distillation tax. It's not merely a temporary setback that more training resolves, but a structural limitation: the student inherits not just the teacher's knowledge, but also its blind spots, failure modes, and implicit biases.
How Distillation Tax manifests involves the student learning a compressed approximation of the teacher's behavior. The teacher model encodes capabilities and knowledge in its weights; distillation attempts to extract and transfer this through soft targets (probability distributions) and matching hidden states. However, the student has fewer parameters, lower capacity, and is forced to make tradeoffs about what to memorize versus what to generalize. If the teacher makes a systematic error or has a particular weakness, the student often amplifies it. Additionally, the student cannot invent knowledge the teacher lacks; it's upper-bounded by the teacher's competence. Even with perfect knowledge transfer, the student's smaller capacity means it cannot express everything the teacher knows. This means a 7B model trained on outputs from a 70B teacher will have a performance ceiling significantly below the teacher, no matter how careful the distillation process.
Why Distillation Tax matters is critical for AI development strategy. As models scale up, there's pressure to create smaller, cheaper, faster versions. Distillation is a key technique for this, yet the tax reminds us that efficiency comes at a cost. Organizations must decide: is a distilled model sufficient for our use case, or do we need the full capability? The existence of a fundamental tax also suggests that truly capable, efficient models may require alternatives to pure distillation—whether that's architecture innovation, different training paradigms, or accepting the tradeoff as the cost of deployment.