A mechanistic interpretability tool using sparse autoencoders to analyze features across model layers.
Sparse crosscoders are a mechanistic interpretability technique that extends sparse autoencoders (SAEs) to operate across multiple layers or even multiple models simultaneously. While a standard SAE learns to reconstruct the activations of a single layer using a sparse set of learned features, a crosscoder is trained to take activations from one layer as input and reconstruct activations from a different layer — or to jointly reconstruct activations from corresponding layers across two distinct models. This cross-layer or cross-model design allows researchers to identify features and computational structures that persist or transform across depth, rather than examining each layer in isolation.
The core mechanism relies on the same sparse coding objective familiar from SAEs: a dictionary of learned feature vectors is used to decompose neural network activations into a small number of active components at any given time. By training the encoder on one set of activations and the decoder to predict another, crosscoders can reveal which features are shared, which are transformed, and which emerge or disappear as information flows through a network. This makes them especially useful for studying how representations evolve across layers and for comparing the internal structure of different models trained on similar tasks.
Sparse crosscoders have become a valuable tool in the broader effort to reverse-engineer large language models. They enable researchers to ask questions like: do two models that behave similarly actually use similar internal representations? How does a feature present in an early layer get transformed or utilized by later layers? These questions are central to understanding model generalization, capability transfer, and the mechanisms behind emergent behaviors. The technique also has practical implications for model editing and steering, since identifying shared or transformed features can inform targeted interventions.
As a relatively recent development within mechanistic interpretability, sparse crosscoders represent a natural evolution of the SAE paradigm toward more relational and comparative analyses of neural network internals. Their ability to bridge layers and models positions them as a promising tool for building a more systematic understanding of how modern deep learning systems represent and process information.