Unembedding

Unembedding is the final transformation step in language models and other neural architectures that maps learned internal representations back into a human-interpretable output space. In transformer-based language models specifically, the unembedding matrix (sometimes called the output projection or language model head) takes the high-dimensional hidden state produced by the final layer and projects it into a probability distribution over the model's vocabulary. This is typically implemented as a learned weight matrix, and in many modern architectures the same matrix is shared with the input embedding layer — a technique known as weight tying — which reduces parameter count and often improves performance.

The mechanics of unembedding are straightforward: the hidden state vector is multiplied by the unembedding matrix to produce a vector of raw scores (logits) over all possible output tokens. These logits are then passed through a softmax function to yield probabilities, from which the next token is sampled or selected. Despite its apparent simplicity, the unembedding matrix encodes rich structure — research in mechanistic interpretability has shown that individual directions in the residual stream can be decoded through the unembedding matrix to reveal meaningful semantic content, making it a key tool for understanding what information models have learned to represent internally.

The concept gained particular relevance in the early 2020s as mechanistic interpretability emerged as a subfield focused on reverse-engineering the computations performed by large language models. Researchers began treating the unembedding matrix not just as an output layer but as a lens for probing intermediate model states — a technique sometimes called the "logit lens." By applying the unembedding matrix to hidden states at intermediate layers, practitioners can observe how a model's token predictions evolve across depth, revealing how information is progressively refined. This perspective has made unembedding central to interpretability research, circuit analysis, and efforts to understand how transformers store and retrieve factual knowledge.

Unembedding

Related

Unembedding

Related

Related

Related