Mamba

Mamba is a sequence modeling architecture developed by Albert Gu and Tri Dao in 2023 that fundamentally challenges the dominance of attention-based transformers. Rather than computing pairwise attention scores between every pair of tokens—an O(n²) operation—Mamba uses a state space model (SSM) framework that scales linearly with sequence length, enabling processing of much longer contexts at lower computational cost.

The critical innovation in Mamba is its selection mechanism, which makes the model content-aware. Previous SSMs used fixed, global parameters that treated all inputs equally. Mamba introduces data-dependent selection where the model learns to control how much information from the current input is incorporated into its hidden state. This allows the same architecture to selectively focus on relevant information without explicit attention weights. The selection parameters are derived from the input in a learnable way, making the effective receptive field dynamic and input-dependent, unlike traditional feed-forward components.

Mamba processes tokens through a recurrent-like state update where each token updates a hidden state, then generates output through a learned projection. This recurrent structure, combined with selection, gives Mamba the pattern-matching capabilities that made transformers successful while preserving the linear scaling of SSMs. Early experiments show Mamba-based models achieve comparable or superior performance to transformers on language modeling tasks while maintaining better memory efficiency and throughput on very long sequences. The architecture has implications for real-time inference, long-context understanding, and scaling to domains like genomics and time series where sequence length is prohibitive under quadratic attention.