Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Mixture of Depths

Mixture of Depths

Selective transformer computation that skips layers for tokens needing less processing

Year: 2024Generality: 468
Back to Vocab

Mixture of Depths (MoD) is a technique that applies different amounts of computational processing to different tokens within a transformer model, rather than processing all tokens through all layers equally. The key insight is that not all tokens require the same computational depth to produce high-quality outputs. Some tokens—such as common words, punctuation, or redundant information—can be processed effectively through fewer transformer layers, while others benefit from deeper processing. This selective routing saves computation without sacrificing quality.

How Mixture of Depths works involves routing tokens to different processing depths dynamically. A routing mechanism (typically learned during training) decides at each layer whether a token should continue deeper into the network or exit early with its current representation. This is similar to mixture-of-experts (MoE) approaches but routes along the depth dimension rather than across expert networks. Early-exiting tokens avoid unnecessary computation, while tokens that need deeper reasoning continue through more layers. The routing is differentiable, allowing the entire system to be trained end-to-end. The result is a variable-depth computation graph where different tokens traverse different paths through the network, optimizing both latency and accuracy.

Why Mixture of Depths matters is economic and practical. Transformers are compute-intensive; large language models require enormous amounts of matrix multiplication. If 30-40% of tokens can exit after just a few layers without quality degradation, the overall computation budget drops significantly. This has immediate implications for inference cost, latency (important for real-time applications), and carbon footprint. It also points toward more efficient AI systems: the future of scaling may not be "bigger models" but "smarter routing." MoD represents a shift from uniform computation to adaptive computation, where the system learns to allocate resources where they matter most.

Related

Related

MoT (Mixture of Transformers)
MoT (Mixture of Transformers)

An architecture combining multiple specialized transformers to capture richer, more diverse representations.

Generality: 337
Mixture of Experts (MoE)
Mixture of Experts (MoE)

An architecture routing inputs to specialized sub-networks via a learned gating mechanism.

Generality: 724
Mixture of a Million Experts
Mixture of a Million Experts

A sparse architecture routing each input to a tiny fraction of millions of specialized subnetworks.

Generality: 94
Scalable MatMul-free Language Modeling
Scalable MatMul-free Language Modeling

Language modeling architectures that replace matrix multiplication with cheaper, scalable alternatives.

Generality: 111
DLMs (Deep Language Models)
DLMs (Deep Language Models)

Deep neural networks trained to understand, generate, and translate human language.

Generality: 796
L2M (Large Memory Model)
L2M (Large Memory Model)

A decoder-only Transformer with addressable auxiliary memory enabling reasoning far beyond its attention window.

Generality: 189