A neural network that dynamically weights input elements to capture relevant context.
An attention network is a type of neural network architecture that learns to selectively focus on different parts of its input when producing each element of its output. Rather than compressing all input information into a single fixed-length representation, attention mechanisms compute a weighted distribution over input positions, allowing the model to dynamically retrieve the most relevant information at each step. These weights, called attention scores, are typically derived by measuring compatibility between a query vector and a set of key vectors, then using the resulting distribution to form a weighted sum of value vectors — a formulation that generalizes across text, images, audio, and multimodal data.
The most influential variant is self-attention, in which a sequence attends to itself to capture internal dependencies regardless of positional distance. This is the core operation in the Transformer architecture, where multiple attention heads run in parallel, each learning to track different types of relationships within the data. Unlike recurrent networks, which process sequences step-by-step and struggle with long-range dependencies, self-attention operates over the entire sequence simultaneously, making it both more expressive and more parallelizable on modern hardware.
Attention networks became central to machine learning following Bahdanau et al.'s 2014 work on neural machine translation, which showed that allowing a decoder to attend over encoder states dramatically improved translation quality on long sentences. The 2017 Transformer paper by Vaswani et al. then demonstrated that attention alone — without any recurrence or convolution — could achieve state-of-the-art results, triggering a paradigm shift across NLP and beyond. Models like BERT, GPT, and Vision Transformers (ViT) are all built on this foundation.
The practical impact of attention networks is difficult to overstate. They have enabled breakthroughs in machine translation, text generation, question answering, protein structure prediction, and image recognition. Their ability to model arbitrary pairwise relationships within and across sequences makes them a highly general tool, and ongoing research continues to address their main limitations — particularly the quadratic computational cost of full attention over long sequences — through sparse, linear, and hierarchical attention variants.