An architecture combining multiple specialized transformers to capture richer, more diverse representations.
Mixture of Transformers (MoT) is a neural network architecture that combines multiple transformer models, each capable of specializing in different aspects of the input data, and aggregates their outputs to produce richer, more expressive representations than any single transformer could achieve alone. The approach draws on principles from ensemble learning and mixture-of-experts modeling, extending them to the transformer paradigm. Rather than relying on one monolithic model to handle all patterns in the data, MoT distributes the representational burden across several transformer components, allowing each to develop complementary strengths.
At the core of MoT is a routing or gating mechanism that determines how inputs are directed to, or weighted across, the constituent transformers. In some formulations, all transformers process the input and their outputs are combined via learned weights; in others, only a subset of transformers is activated for any given input, a strategy that improves computational efficiency by avoiding unnecessary computation. This selective activation mirrors the sparse gating approach popularized by Mixture of Experts (MoE) models, but applies it at the level of full transformer blocks rather than individual feed-forward layers.
MoT architectures are particularly valuable in settings where data is heterogeneous or multi-modal — for example, tasks that require jointly reasoning over text, images, or structured data. By allowing different transformers to specialize on different modalities or feature types, the overall model can adapt more flexibly to complex inputs. This specialization also tends to improve generalization, since the ensemble of transformers is less likely to overfit to any single pattern or representation strategy.
The practical appeal of MoT lies in its balance between capacity and efficiency. Scaling a single transformer to handle diverse tasks typically requires enormous parameter counts and compute budgets. MoT offers an alternative path: a larger effective model capacity through specialization, while keeping inference costs manageable through sparse activation. As large-scale language and vision models have pushed the boundaries of what single-architecture transformers can do, MoT and related mixture-based designs have emerged as a compelling direction for building more capable and efficient systems.