Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Sparse Attention

Sparse Attention

An attention mechanism that selectively computes on a subset of tokens.

Year: 2023Generality: 750Added: Jun 2, 2026
Back to Vocab

Sparse attention is an efficient attention mechanism that selectively computes relevance scores for only a subset of tokens rather than the full sequence, dramatically reducing the quadratic computational cost of standard attention.

Unlike full attention, which computes dot products between every query and every key-value pair, sparse attention introduces a pre-filtering stage that determines which token pairs merit computation. Common strategies include local window attention (attending only to nearby tokens), global attention tokens that communicate across the full sequence, and blockwise sparse patterns that partition the attention matrix. This reduces complexity from O(n²) to approximately O(n√n) or O(n log n), depending on the sparsity pattern.

The tradeoffs center on the tension between efficiency and expressiveness. Sparse patterns may miss long-range dependencies that full attention would capture, potentially degrading performance on tasks requiring global context. However, when designed carefully — as in Mistral's sliding window attention or MiniMax's MSA — sparse mechanisms can match full attention on most benchmarks while dramatically improving throughput and memory efficiency for long contexts.

Open questions remain about optimal sparsity patterns. Whether fixed local windows, learned sparse masks, or content-dependent selection yields the best results depends on the task. There is also active research into combining sparse attention with other efficiency techniques like flash attention and quantization, and into theoretical understanding of when sparse approximations are sufficient versus when full attention is truly necessary.