A large, pre-trained model that transfers knowledge to a smaller student model.
In machine learning, a teacher model is a large, high-capacity neural network that has been trained to high accuracy on a given task and serves as a source of structured knowledge for training a smaller, more efficient counterpart known as a student model. Rather than training the student model directly on raw labeled data alone, the teacher-student framework leverages the teacher's output distributions — often called "soft labels" or "soft targets" — which encode richer information about class relationships and model uncertainty than hard one-hot labels do. This additional signal helps the student learn more effectively, often achieving performance that would be difficult to reach through conventional supervised training alone.
The mechanics of knowledge distillation, the primary context in which teacher models operate, involve minimizing a loss function that combines the student's error on ground-truth labels with a divergence measure between the student's and teacher's output distributions. A temperature parameter is typically applied to the softmax outputs of both models, softening the probability distributions and amplifying the informational content of low-probability predictions. This allows the student to absorb nuanced patterns the teacher has internalized — for instance, the degree to which two classes are visually or semantically similar — rather than simply learning which class is most likely.
Teacher models matter because they make it practical to deploy powerful AI capabilities in resource-constrained environments. A teacher might be a massive transformer or ensemble that is too slow or memory-intensive for edge devices, mobile applications, or real-time inference systems. By distilling its knowledge into a compact student, practitioners can retain much of the teacher's predictive power at a fraction of the computational cost. Beyond compression, the teacher-student paradigm has expanded into self-distillation, where a model teaches itself across training stages, and into semi-supervised learning, where an unlabeled dataset is annotated by the teacher before the student trains on it. These extensions have made the teacher model concept a versatile and widely adopted tool across modern deep learning workflows.