A technique that controls which positions a transformer's attention mechanism can access.
Attention masking is a mechanism used in transformer-based models to selectively restrict which positions in a sequence a given token is allowed to attend to. In the standard self-attention operation, every token in a sequence can interact with every other token, producing a full attention matrix. Masking modifies this matrix by setting certain entries to negative infinity before the softmax step, effectively zeroing out those attention weights and preventing information from flowing along specific paths. This gives practitioners precise control over the information structure the model sees during both training and inference.
Two primary forms of attention masking are used in practice. Padding masks handle variable-length sequences within a batch by marking positions occupied by padding tokens — placeholder values added to make sequences equal length — so the model does not treat them as meaningful content. Causal masks (also called autoregressive or look-ahead masks) enforce a strict left-to-right ordering by preventing each position from attending to any future position. This is essential for language modeling and text generation tasks, where predicting the next token must rely only on previously observed tokens, not on tokens that haven't been generated yet.
Attention masking became central to NLP with the 2017 "Attention Is All You Need" paper, which introduced the Transformer architecture. The decoder in that model uses causal masking to preserve autoregressive generation, while the encoder uses padding masks to handle batched inputs cleanly. Subsequent architectures refined these ideas: BERT-style models use bidirectional attention with padding masks and introduce special masked-language-modeling objectives, while GPT-style models rely heavily on causal masking to enable coherent text generation at scale.
Beyond NLP, attention masking has proven valuable wherever structured constraints on information flow are needed — in vision transformers, graph-based models, and multimodal systems. Its importance lies in the fact that it is not merely a technical workaround for batching or causality; it is a principled way to encode inductive biases directly into the attention computation, shaping what a model can and cannot learn from its context.