Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Attention Masking

Attention Masking

A technique that controls which positions a transformer's attention mechanism can access.

Year: 2017Generality: 694
Back to Vocab

Attention masking is a mechanism used in transformer-based models to selectively restrict which positions in a sequence a given token is allowed to attend to. In the standard self-attention operation, every token in a sequence can interact with every other token, producing a full attention matrix. Masking modifies this matrix by setting certain entries to negative infinity before the softmax step, effectively zeroing out those attention weights and preventing information from flowing along specific paths. This gives practitioners precise control over the information structure the model sees during both training and inference.

Two primary forms of attention masking are used in practice. Padding masks handle variable-length sequences within a batch by marking positions occupied by padding tokens — placeholder values added to make sequences equal length — so the model does not treat them as meaningful content. Causal masks (also called autoregressive or look-ahead masks) enforce a strict left-to-right ordering by preventing each position from attending to any future position. This is essential for language modeling and text generation tasks, where predicting the next token must rely only on previously observed tokens, not on tokens that haven't been generated yet.

Attention masking became central to NLP with the 2017 "Attention Is All You Need" paper, which introduced the Transformer architecture. The decoder in that model uses causal masking to preserve autoregressive generation, while the encoder uses padding masks to handle batched inputs cleanly. Subsequent architectures refined these ideas: BERT-style models use bidirectional attention with padding masks and introduce special masked-language-modeling objectives, while GPT-style models rely heavily on causal masking to enable coherent text generation at scale.

Beyond NLP, attention masking has proven valuable wherever structured constraints on information flow are needed — in vision transformers, graph-based models, and multimodal systems. Its importance lies in the fact that it is not merely a technical workaround for batching or causality; it is a principled way to encode inductive biases directly into the attention computation, shaping what a model can and cannot learn from its context.

Related

Related

Masking
Masking

Blocking certain input positions from attention to enforce valid information flow.

Generality: 694
Sequence Masking
Sequence Masking

Technique that selectively hides input tokens to control what a model attends to.

Generality: 628
Similarity Masking
Similarity Masking

Suppressing redundant or overly similar features to sharpen model focus on distinct information.

Generality: 293
Self-Attention
Self-Attention

A mechanism that lets neural networks weigh relationships between all parts of an input simultaneously.

Generality: 794
Attention Block
Attention Block

A neural network module that selectively weighs input elements by their contextual relevance.

Generality: 752
Attention Mechanisms
Attention Mechanisms

Neural network components that dynamically weight input elements by their contextual relevance.

Generality: 865