A neural network module that selectively weighs input elements by their contextual relevance.
An attention block is a modular component in neural networks that enables a model to dynamically focus on the most relevant parts of an input sequence when generating each element of its output. Rather than treating all input positions equally, the block learns to assign varying weights to different positions based on their contextual importance — a capability that proves especially powerful when modeling long-range dependencies that simpler architectures struggle to capture.
The mechanism works by computing three learned projections from the input: queries, keys, and values. The query represents what the current output position is "looking for," while keys represent what each input position "offers." A compatibility score is computed between each query and all keys — typically via scaled dot-product — and passed through a softmax to produce a probability distribution of attention weights. These weights are then used to take a weighted sum of the value vectors, producing a context-aware representation. In practice, modern architectures use multi-head attention, running several attention operations in parallel across different learned subspaces and concatenating the results, allowing the model to simultaneously attend to information from multiple representational perspectives.
Attention blocks became foundational to the transformer architecture introduced by Vaswani et al. in 2017, which replaced recurrent and convolutional layers entirely with stacked attention and feed-forward blocks. This shift unlocked massive parallelism during training, since attention over a sequence can be computed in a single matrix operation rather than step-by-step. The result was dramatically faster training and superior performance on sequence modeling tasks, catalyzing breakthroughs in natural language processing and, later, computer vision and multimodal learning.
Today, attention blocks are among the most widely deployed components in deep learning. They form the backbone of large language models like GPT and BERT, vision transformers, and cross-modal architectures. Their ability to flexibly route information across arbitrary positions in a sequence — without inductive biases like locality or fixed receptive fields — makes them highly general and adaptable across domains, from protein structure prediction to code generation.