Blocking certain input positions from attention to enforce valid information flow.
Masking is a technique used in neural network models—particularly transformers—to control which positions in a sequence are allowed to influence the computation at any given step. By setting certain attention weights to negative infinity (or zero after softmax), masking effectively prevents a model from "seeing" specific tokens during training or inference. This is essential for preserving the causal or structural constraints that govern how information should flow through a model.
The most common form is causal masking (also called autoregressive masking), used in decoder-style models like GPT. Here, a triangular mask ensures that when predicting the token at position t, the model can only attend to positions 1 through t—never to future tokens. This mirrors the real-world constraint that a language model generating text cannot know what comes next before it has generated it. Without this mask, a model trained on next-token prediction could trivially cheat by attending to the answer directly.
A second major form is padding masking, which prevents the model from attending to padding tokens added to make variable-length sequences uniform within a batch. A third, more distinctive form is masked language modeling (MLM), introduced with BERT in 2018. In MLM, random tokens in the input are replaced with a special [MASK] token, and the model is trained to predict the original values. Unlike causal masking, MLM allows bidirectional attention over the unmasked context, enabling richer representations suited for understanding tasks rather than generation.
Masking matters because it is the mechanism that enforces the learning objective itself. Without causal masks, autoregressive models cannot learn proper sequential dependencies; without MLM masks, BERT-style pretraining loses its self-supervised signal. As transformers have become the dominant architecture across NLP, vision, and multimodal AI, masking strategies have grown more sophisticated—spanning sparse attention masks, structured dropout masks, and task-specific attention patterns—making masking one of the most practically consequential design choices in modern deep learning.