A mechanism enabling neural networks to dynamically focus on relevant parts of input.
Attention is a neural network mechanism that allows models to selectively weight different parts of their input when producing an output, rather than treating all input elements equally. Instead of compressing an entire input sequence into a single fixed-length representation, attention computes a set of scores indicating how relevant each input element is to a given output step. These scores are normalized into a probability distribution and used to form a weighted sum of input representations, effectively directing the model's focus toward the most contextually important information at each step.
The mechanics of attention typically involve three components: queries, keys, and values. A query represents what the model is currently trying to compute; keys represent the available input elements; and values carry the actual content to be aggregated. Compatibility between a query and each key is measured—often via dot product—and the resulting scores are passed through a softmax function to produce attention weights. The final output is a weighted combination of the values, emphasizing those most relevant to the query. This formulation, known as scaled dot-product attention, underpins the Transformer architecture introduced in 2017.
Attention became central to machine learning with Bahdanau et al.'s 2014 work on neural machine translation, where it solved the bottleneck problem in encoder-decoder models by allowing the decoder to look back at all encoder states rather than relying on a single compressed vector. This dramatically improved translation quality on long sentences. The 2017 Transformer paper then demonstrated that attention alone—without recurrence or convolution—could achieve state-of-the-art results, making attention the dominant paradigm in sequence modeling.
The impact of attention on modern AI is difficult to overstate. It is the foundational operation in large language models such as GPT and BERT, and has been extended to computer vision, multimodal learning, and reinforcement learning. Multi-head attention, which runs several attention operations in parallel and concatenates their outputs, allows models to simultaneously capture different types of relationships within data. Attention mechanisms have enabled models to scale to unprecedented sizes while maintaining the ability to capture long-range dependencies that earlier architectures struggled to represent.