Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Mixture of Experts (MoE)

Mixture of Experts (MoE)

An architecture routing inputs to specialized sub-networks via a learned gating mechanism.

Year: 1991Generality: 724
Back to Vocab

Mixture of Experts (MoE) is a machine learning architecture that replaces a single monolithic model with a collection of specialized sub-networks, called experts, each trained to handle a distinct region of the input space. A learned gating network sits atop these experts and determines, for each incoming input, which expert or combination of experts should produce the output. Because only a subset of experts is activated per input, MoE models can scale total parameter count dramatically without proportionally increasing inference cost — a property that makes them especially attractive for large-scale language and multimodal models.

During training, the gating network and the expert networks are optimized jointly. The gating network typically outputs a probability distribution over experts, and the final prediction is a weighted combination of expert outputs. In practice, sparse gating — where only the top-k experts are selected and the rest receive zero weight — is preferred because it keeps computation tractable. Auxiliary load-balancing losses are often added to prevent the gating network from collapsing onto a small number of favored experts, ensuring that specialization is distributed across the full ensemble.

MoE architectures became central to modern deep learning after sparse gating techniques were scaled to transformer models. Landmark systems such as Google's Switch Transformer and GLaM demonstrated that replacing dense feed-forward layers with MoE layers could achieve state-of-the-art performance at a fraction of the compute budget of equivalent dense models. This insight has since been adopted widely, with MoE layers appearing in frontier language models where hundreds of billions of parameters can be maintained while keeping per-token FLOPs manageable.

The practical appeal of MoE lies in its ability to decouple model capacity from computational cost. By routing each input only through the experts most relevant to it, the architecture achieves a form of conditional computation that dense networks cannot match at scale. This makes MoE a foundational design pattern for building efficient, high-capacity AI systems across language, vision, and multimodal domains.

Related

Related

Mixture of a Million Experts
Mixture of a Million Experts

A sparse architecture routing each input to a tiny fraction of millions of specialized subnetworks.

Generality: 94
MoT (Mixture of Transformers)
MoT (Mixture of Transformers)

An architecture combining multiple specialized transformers to capture richer, more diverse representations.

Generality: 337
Panel-of-Experts
Panel-of-Experts

Multiple specialized models or experts collaborate to produce better collective decisions.

Generality: 624
Mixture of Depths
Mixture of Depths

Selective transformer computation that skips layers for tokens needing less processing

Generality: 468
Mixture Model
Mixture Model

A probabilistic model representing data as drawn from multiple component distributions.

Generality: 796
Gating Mechanism
Gating Mechanism

A learned control system that selectively regulates information flow through a neural network.

Generality: 781