An attention mechanism that selectively computes on a subset of tokens.
Sparse attention is an efficient attention mechanism that selectively computes relevance scores for only a subset of tokens rather than the full sequence, dramatically reducing the quadratic computational cost of standard attention.
Unlike full attention, which computes dot products between every query and every key-value pair, sparse attention introduces a pre-filtering stage that determines which token pairs merit computation. Common strategies include local window attention (attending only to nearby tokens), global attention tokens that communicate across the full sequence, and blockwise sparse patterns that partition the attention matrix. This reduces complexity from O(n²) to approximately O(n√n) or O(n log n), depending on the sparsity pattern.
The tradeoffs center on the tension between efficiency and expressiveness. Sparse patterns may miss long-range dependencies that full attention would capture, potentially degrading performance on tasks requiring global context. However, when designed carefully — as in Mistral's sliding window attention or MiniMax's MSA — sparse mechanisms can match full attention on most benchmarks while dramatically improving throughput and memory efficiency for long contexts.
Open questions remain about optimal sparsity patterns. Whether fixed local windows, learned sparse masks, or content-dependent selection yields the best results depends on the task. There is also active research into combining sparse attention with other efficiency techniques like flash attention and quantization, and into theoretical understanding of when sparse approximations are sufficient versus when full attention is truly necessary.