Technique that selectively hides input tokens to control what a model attends to.
Sequence masking is a technique used in machine learning to selectively prevent certain positions in an input sequence from influencing a model's computations. It works by constructing a binary or boolean mask — a tensor of the same shape as the input — where designated positions are marked to be ignored or zeroed out during forward passes. This allows models to distinguish between meaningful tokens and irrelevant ones, such as padding added to make variable-length sequences uniform in batch processing.
The most common application is padding masking, where shorter sequences in a batch are padded to match the longest sequence, and a mask is applied so the model does not treat padding tokens as real input. Without this, attention mechanisms and loss functions would incorrectly incorporate padding into their calculations, degrading model performance. A second critical form is causal masking (also called look-ahead masking), used in autoregressive models like GPT. Here, the mask is a lower-triangular matrix that prevents each position from attending to future tokens, enforcing the constraint that predictions at step t depend only on tokens at positions 1 through t.
Sequence masking became especially prominent with the rise of the Transformer architecture, introduced in the 2017 paper Attention Is All You Need by Vaswani et al. In self-attention, the mask is added to the raw attention logits before the softmax operation — masked positions receive a very large negative value (e.g., −∞), driving their post-softmax weight to zero. This elegant integration made masking a first-class citizen in modern NLP pipelines, enabling both efficient batched training and correct autoregressive decoding.
Beyond padding and causal masking, the concept extends to span masking used in pretraining objectives like those in BERT and T5, where contiguous token spans are hidden to force the model to learn contextual representations. Sequence masking is therefore not a single technique but a family of related strategies that shape what information flows through a model at training and inference time, making it foundational to virtually all modern sequence modeling architectures.