Knowledge Distillation

Knowledge distillation transfers capabilities from a large, high-performing model to a smaller, more efficient one by training the student on the teacher's output distributions rather than raw labels.

Instead of learning from one-hot correct answers, the student learns the full probability distribution over all possible outputs that the teacher produces. A "cat" image might receive 0.7 probability for the cat class and 0.25 for dog, and the student internalizes why those relative probabilities matter. The soft targets capture dark knowledge: relationships between classes and confidence calibration that plain labels cannot express. Modern approaches often combine temperature-scaled softmax outputs with contrastive learning signals to preserve both the teacher's predictions and learned representations.

Distilled models can achieve 90–95% of the teacher's performance while requiring substantially less compute and memory, but performance gaps remain for rare or ambiguous cases where the teacher's probability distributions are themselves uninformative. The teacher must be substantially larger and more capable than the student for the knowledge transfer to be meaningful, and the student inherits any biases present in the teacher.

The theoretical conditions under which dark knowledge is maximally useful for a student are not fully characterized. Whether certain knowledge types — factual, procedural, stylistic — distill more effectively than others remains an open empirical question. The relationship between teacher-student capacity ratio and downstream task performance is not yet captured by reliable scaling laws.