A lightweight model trained on internal representations to reveal what a neural network has learned.
A probe is a diagnostic technique in machine learning interpretability where a simple secondary model — typically a linear classifier or shallow network — is trained on the internal activations of a larger, pre-trained model. The goal is to test whether specific types of information, such as part-of-speech tags, syntactic structure, or semantic properties, are encoded within a particular layer's representations. By holding the target model's weights fixed and only training the probe, researchers can attribute any predictive success to the information already present in the representations rather than to the probe's own capacity. This makes probing a relatively controlled method for interrogating what a model has implicitly learned during training.
Probing became especially prominent with the rise of large pretrained language models like ELMo and BERT in the late 2010s, where researchers sought to understand why contextual embeddings transferred so effectively across tasks. Studies using probes revealed that different layers of these models encode qualitatively different linguistic properties — lower layers capturing surface-level features and higher layers encoding more abstract semantic content. This layered structure of learned representations was not obvious from model architecture alone, and probing provided a tractable empirical window into it.
Despite its utility, probing has important limitations. A probe's success does not necessarily mean the target model actively uses that information during inference — it only demonstrates that the information is recoverable from the representations. Critics have also noted that a sufficiently expressive probe can extract information that is only weakly or incidentally encoded, inflating apparent interpretability. These concerns have spurred refinements such as minimum description length probes and control tasks, which help calibrate how much a probe's accuracy reflects genuine encoding versus probe-side learning. Probing remains a widely used tool in mechanistic interpretability, model evaluation, and the study of transfer learning.