A mechanism that lets neural networks weigh relationships between all parts of an input simultaneously.
Self-attention is a neural network mechanism that computes relationships between every element in a sequence and every other element, allowing the model to dynamically determine how much focus to place on each part of the input when representing any given position. Rather than processing tokens in isolation or relying on fixed local context windows, self-attention produces a weighted combination of all input representations, where the weights reflect learned relevance scores between pairs of elements. This enables the model to capture both short- and long-range dependencies within a single, parallelizable operation.
The mechanism works by projecting each input token into three vectors — queries, keys, and values — using learned weight matrices. Attention scores are computed as the scaled dot product between a query vector and all key vectors, then passed through a softmax to produce a probability distribution. These scores weight the corresponding value vectors, and the resulting sum becomes the new representation for that position. In practice, multiple sets of query-key-value projections are run in parallel (multi-head attention), allowing the model to simultaneously attend to different types of relationships across different representation subspaces.
Self-attention became central to machine learning with the 2017 paper "Attention Is All You Need" by Vaswani et al., which introduced the Transformer architecture built entirely around this mechanism. Prior sequence models relied on recurrent or convolutional layers that struggled with long-range dependencies and were difficult to parallelize during training. Self-attention addressed both limitations, enabling dramatically faster training on modern hardware and superior modeling of distant contextual relationships.
The impact of self-attention on the field has been profound. It underpins virtually every major language model developed since 2017, including BERT, GPT, T5, and their successors, and has since been applied far beyond NLP to vision, audio, protein structure prediction, and multimodal learning. Its ability to flexibly route information across an entire context window — without the bottlenecks of sequential processing — makes it one of the most consequential architectural innovations in modern deep learning.