Attention mechanism that jointly attends to information from multiple representation subspaces simultaneously.
Multi-head attention is a central component of the transformer architecture that extends the basic attention mechanism by running multiple attention operations in parallel. Rather than computing a single attention function over the full dimensionality of the input, the model linearly projects queries, keys, and values into several lower-dimensional subspaces — one per "head" — and performs scaled dot-product attention independently within each. The outputs from all heads are then concatenated and projected back to the original dimensionality, producing a rich composite representation. This design allows the model to simultaneously attend to information from different positions and capture diverse relational patterns that a single attention pass might miss.
The motivation behind multiple heads is representational diversity. Different heads can specialize in different types of relationships: one head might track syntactic dependencies between words, another might capture semantic similarity, and a third might focus on positional proximity. This specialization emerges through training rather than explicit design, making multi-head attention a flexible and powerful inductive structure. The computational cost scales linearly with the number of heads when the per-head dimensionality is reduced proportionally, so the mechanism adds expressive capacity without dramatically increasing compute.
Multi-head attention was introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al., which proposed the transformer as a sequence-to-sequence model relying entirely on attention rather than recurrence or convolution. The mechanism proved highly effective for machine translation and quickly generalized to a vast range of tasks including text summarization, question answering, image recognition, and protein structure prediction. Models such as BERT, GPT, and Vision Transformer (ViT) all build directly on this foundation.
The broader significance of multi-head attention lies in its role in enabling large-scale pretraining. Because attention operations are highly parallelizable across sequence positions, transformers can be trained efficiently on modern hardware accelerators, making it practical to scale models to billions of parameters. This scalability, combined with the mechanism's ability to model long-range dependencies without the vanishing gradient problems of recurrent networks, has made multi-head attention one of the most consequential architectural innovations in modern deep learning.