An attention mechanism combining persistent memory tokens with sparse connectivity for efficient long-range modeling.
Memory Sparse Attention is a hybrid attention mechanism designed to address two fundamental limitations of standard dense self-attention: its quadratic computational cost with respect to sequence length, and its inability to retain information beyond the current context window. The approach fuses sparse attention patterns — where each token attends to only a structured subset of other tokens rather than all of them — with a set of persistent "memory" tokens that are always attended to regardless of sparsity constraints. This combination allows the model to maintain global information pathways through the memory slots while dramatically reducing the number of attention operations required for local or positional reasoning.
In practice, Memory Sparse Attention layers partition the attention computation into two components. The sparse component uses patterns such as local windows, strided or dilated connections, or learned dynamic routing to limit each query's key-value lookups to O(√n) or O(log n) positions. The memory component appends a small fixed set of trainable or contextually updated vectors to the key-value pool, ensuring every token can access a compressed global summary of the sequence. During training, gradients flow through both pathways, encouraging the memory tokens to distill long-range dependencies that the sparse local pattern would otherwise miss.
This architecture is particularly valuable for processing very long documents, genomic sequences, audio waveforms, and other modalities where sequence lengths reach tens of thousands of tokens. By keeping the memory token count small (typically 16–512 vectors), the additional overhead is negligible compared to the savings from sparsification. Models like Longformer, BigBird, and related variants incorporate memory-like global tokens alongside their sparse local attention, demonstrating strong empirical performance on tasks requiring both fine-grained local pattern recognition and broad contextual understanding.
The significance of Memory Sparse Attention extends beyond efficiency: it offers a principled inductive bias that mirrors how biological and cognitive systems separate working memory from moment-to-moment perception. As language models scale to handle book-length contexts and multimodal inputs, mechanisms that decouple global memory from local computation are increasingly central to practical deployment, enabling transformers to operate within realistic hardware budgets without sacrificing the representational power that makes attention so effective.