Selective transformer computation that skips layers for tokens needing less processing
Mixture of Depths (MoD) is a technique that applies different amounts of computational processing to different tokens within a transformer model, rather than processing all tokens through all layers equally. The key insight is that not all tokens require the same computational depth to produce high-quality outputs. Some tokens—such as common words, punctuation, or redundant information—can be processed effectively through fewer transformer layers, while others benefit from deeper processing. This selective routing saves computation without sacrificing quality.
How Mixture of Depths works involves routing tokens to different processing depths dynamically. A routing mechanism (typically learned during training) decides at each layer whether a token should continue deeper into the network or exit early with its current representation. This is similar to mixture-of-experts (MoE) approaches but routes along the depth dimension rather than across expert networks. Early-exiting tokens avoid unnecessary computation, while tokens that need deeper reasoning continue through more layers. The routing is differentiable, allowing the entire system to be trained end-to-end. The result is a variable-depth computation graph where different tokens traverse different paths through the network, optimizing both latency and accuracy.
Why Mixture of Depths matters is economic and practical. Transformers are compute-intensive; large language models require enormous amounts of matrix multiplication. If 30-40% of tokens can exit after just a few layers without quality degradation, the overall computation budget drops significantly. This has immediate implications for inference cost, latency (important for real-time applications), and carbon footprint. It also points toward more efficient AI systems: the future of scaling may not be "bigger models" but "smarter routing." MoD represents a shift from uniform computation to adaptive computation, where the system learns to allocate resources where they matter most.