Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Mixture of a Million Experts

Mixture of a Million Experts

A sparse architecture routing each input to a tiny fraction of millions of specialized subnetworks.

Year: 2021Generality: 94
Back to Vocab

Mixture of a Million Experts refers to an extreme-scale variant of the Mixture-of-Experts (MoE) architecture, where a learned routing mechanism selects only a small handful of active experts — out of hundreds of thousands or even millions — to process each input token. This design exploits the principle of conditional computation: rather than passing every input through every parameter, the model activates only the relevant specialists, keeping per-token floating-point operations comparable to a much smaller dense model while the total parameter count scales dramatically. The result is a system that can store vastly more knowledge and handle greater linguistic or domain diversity than a uniformly dense network of equivalent compute cost.

The routing mechanism is central to how these systems work. A gating function — typically a learned softmax with top-k selection, a hashing scheme, or a hybrid approach — maps each token to a small subset of experts and aggregates their outputs, usually as a weighted sum. At this scale, naive routing becomes a serious engineering challenge: ensuring that experts receive roughly equal traffic (load balancing) is critical, since collapsed or overloaded routing wastes capacity and destabilizes training. Practical solutions include auxiliary load-balancing losses, expert dropout, and hierarchical or hash-based routing that reduces the metadata overhead of managing millions of expert assignments. Systems-level concerns — all-to-all communication across devices, memory fragmentation, topology-aware expert placement, and efficient checkpointing — are equally important and have driven innovations in distributed training frameworks.

The appeal of million-expert MoEs lies in their ability to specialize at a granularity impossible for dense models. Rare languages, niche domains, and long-tail knowledge can each be handled by dedicated expert subnetworks without penalizing the majority of inputs. This makes the architecture particularly attractive for massively multilingual models, personalized or federated expert pools, and modular multimodal systems. Researchers have also noted improved sample efficiency for rare-context modeling, since relevant experts can be updated selectively rather than diffusing gradient signal across all parameters.

The concept builds on MoE ideas from the early 1990s, but became practically relevant to large-scale deep learning with the sparsely-gated MoE work of 2017 and accelerated through the Switch Transformer, GShard, and subsequent extreme-scale efforts in the early 2020s, when routing algorithms and distributed infrastructure finally made million-expert counts tractable. The field continues to evolve rapidly, with ongoing work on quantized or offloaded expert weights, dynamic expert addition, and better optimization methods for sparse gradient paths.

Related

Related

Mixture of Experts (MoE)
Mixture of Experts (MoE)

An architecture routing inputs to specialized sub-networks via a learned gating mechanism.

Generality: 724
MoT (Mixture of Transformers)
MoT (Mixture of Transformers)

An architecture combining multiple specialized transformers to capture richer, more diverse representations.

Generality: 337
Mixture of Depths
Mixture of Depths

Selective transformer computation that skips layers for tokens needing less processing

Generality: 468
Panel-of-Experts
Panel-of-Experts

Multiple specialized models or experts collaborate to produce better collective decisions.

Generality: 624
Router
Router

A mechanism that directs queries to the most suitable model or component in a multi-model system.

Generality: 521
Scalable MatMul-free Language Modeling
Scalable MatMul-free Language Modeling

Language modeling architectures that replace matrix multiplication with cheaper, scalable alternatives.

Generality: 111