An architecture routing inputs to specialized sub-networks via a learned gating mechanism.
Mixture of Experts (MoE) is a machine learning architecture that replaces a single monolithic model with a collection of specialized sub-networks, called experts, each trained to handle a distinct region of the input space. A learned gating network sits atop these experts and determines, for each incoming input, which expert or combination of experts should produce the output. Because only a subset of experts is activated per input, MoE models can scale total parameter count dramatically without proportionally increasing inference cost — a property that makes them especially attractive for large-scale language and multimodal models.
During training, the gating network and the expert networks are optimized jointly. The gating network typically outputs a probability distribution over experts, and the final prediction is a weighted combination of expert outputs. In practice, sparse gating — where only the top-k experts are selected and the rest receive zero weight — is preferred because it keeps computation tractable. Auxiliary load-balancing losses are often added to prevent the gating network from collapsing onto a small number of favored experts, ensuring that specialization is distributed across the full ensemble.
MoE architectures became central to modern deep learning after sparse gating techniques were scaled to transformer models. Landmark systems such as Google's Switch Transformer and GLaM demonstrated that replacing dense feed-forward layers with MoE layers could achieve state-of-the-art performance at a fraction of the compute budget of equivalent dense models. This insight has since been adopted widely, with MoE layers appearing in frontier language models where hundreds of billions of parameters can be maintained while keeping per-token FLOPs manageable.
The practical appeal of MoE lies in its ability to decouple model capacity from computational cost. By routing each input only through the experts most relevant to it, the architecture achieves a form of conditional computation that dense networks cannot match at scale. This makes MoE a foundational design pattern for building efficient, high-capacity AI systems across language, vision, and multimodal domains.