Attention mechanism that integrates information across two distinct input sequences.
Cross-attention is a neural network mechanism that enables a model to selectively draw information from one sequence while processing another. Unlike self-attention, where a sequence attends to itself, cross-attention computes relationships between two separate inputs. Queries are derived from one source — typically a decoder or target representation — while keys and values come from a different source, such as an encoder output. The resulting attention scores determine how much each element of the query sequence should attend to each element of the key-value sequence, producing a context-aware representation that blends information across the two streams.
The mechanism became central to machine learning with the Transformer architecture introduced by Vaswani et al. in 2017. In sequence-to-sequence tasks like machine translation, cross-attention allows the decoder to dynamically consult the encoder's full representation at every generation step, rather than relying on a fixed compressed context vector. This dramatically improved the model's ability to handle long-range dependencies and align output tokens with relevant portions of the input — for example, matching translated words to their source-language counterparts.
Beyond language tasks, cross-attention has proven essential in multimodal architectures that must bridge different data modalities. In image captioning, a text decoder can attend over spatial visual features to ground each generated word in a relevant image region. In models like DALL·E and Stable Diffusion, cross-attention layers allow image generation to be conditioned on text embeddings, enabling fine-grained semantic control over visual outputs. Perceiver-style architectures use cross-attention as a general interface for ingesting high-dimensional inputs of arbitrary structure.
Cross-attention matters because it provides a principled, learnable mechanism for information fusion across heterogeneous sources. Rather than requiring rigid alignment or hand-crafted fusion rules, the model learns which parts of one input are relevant to each part of another, adapting dynamically to context. This flexibility has made cross-attention a foundational building block in modern large-scale models spanning language, vision, audio, and their combinations.