Neural network components that dynamically weight input elements by their contextual relevance.
Attention mechanisms are components in neural networks that allow a model to selectively focus on different parts of its input when producing each element of its output. Rather than compressing an entire input sequence into a single fixed-length representation, attention computes a weighted sum over input elements, where the weights reflect how relevant each element is to the current computation step. These weights are learned dynamically based on the relationship between the current query — what the model is trying to produce — and the available keys and values derived from the input. This soft, differentiable selection process allows gradients to flow cleanly during training, making attention both powerful and practical.
In practice, attention is most commonly implemented as scaled dot-product attention, where query, key, and value matrices are derived from the input via learned linear projections. The dot product between a query and each key produces a raw relevance score, which is scaled and passed through a softmax to yield a probability distribution over inputs. The output is then a weighted combination of the value vectors. Multi-head attention extends this by running several attention operations in parallel across different learned subspaces, allowing the model to simultaneously capture multiple types of relationships within the data.
Attention mechanisms became central to machine learning after Bahdanau et al. introduced them in 2014 to address the bottleneck of fixed-length encodings in neural machine translation. The approach allowed translation models to align source and target words dynamically, dramatically improving performance on long sentences. The concept reached its full expression in the 2017 Transformer architecture, which dispensed with recurrence entirely and built deep networks purely from attention and feed-forward layers. This shift unlocked massive parallelism during training and enabled scaling to unprecedented model sizes.
The impact of attention mechanisms extends well beyond NLP. Vision Transformers apply attention directly to image patches, and multimodal models use cross-attention to align representations across text, images, and audio. Attention weights also offer a degree of interpretability, revealing which inputs a model emphasizes for a given prediction. Today, attention is arguably the single most important architectural primitive in modern deep learning.