Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Cross-Attention

Cross-Attention

Attention mechanism that integrates information across two distinct input sequences.

Year: 2017Generality: 756
Back to Vocab

Cross-attention is a neural network mechanism that enables a model to selectively draw information from one sequence while processing another. Unlike self-attention, where a sequence attends to itself, cross-attention computes relationships between two separate inputs. Queries are derived from one source — typically a decoder or target representation — while keys and values come from a different source, such as an encoder output. The resulting attention scores determine how much each element of the query sequence should attend to each element of the key-value sequence, producing a context-aware representation that blends information across the two streams.

The mechanism became central to machine learning with the Transformer architecture introduced by Vaswani et al. in 2017. In sequence-to-sequence tasks like machine translation, cross-attention allows the decoder to dynamically consult the encoder's full representation at every generation step, rather than relying on a fixed compressed context vector. This dramatically improved the model's ability to handle long-range dependencies and align output tokens with relevant portions of the input — for example, matching translated words to their source-language counterparts.

Beyond language tasks, cross-attention has proven essential in multimodal architectures that must bridge different data modalities. In image captioning, a text decoder can attend over spatial visual features to ground each generated word in a relevant image region. In models like DALL·E and Stable Diffusion, cross-attention layers allow image generation to be conditioned on text embeddings, enabling fine-grained semantic control over visual outputs. Perceiver-style architectures use cross-attention as a general interface for ingesting high-dimensional inputs of arbitrary structure.

Cross-attention matters because it provides a principled, learnable mechanism for information fusion across heterogeneous sources. Rather than requiring rigid alignment or hand-crafted fusion rules, the model learns which parts of one input are relevant to each part of another, adapting dynamically to context. This flexibility has made cross-attention a foundational building block in modern large-scale models spanning language, vision, audio, and their combinations.

Related

Related

Attention
Attention

A mechanism enabling neural networks to dynamically focus on relevant parts of input.

Generality: 875
Self-Attention
Self-Attention

A mechanism that lets neural networks weigh relationships between all parts of an input simultaneously.

Generality: 794
Attention Network
Attention Network

A neural network that dynamically weights input elements to capture relevant context.

Generality: 796
Attention Mechanisms
Attention Mechanisms

Neural network components that dynamically weight input elements by their contextual relevance.

Generality: 865
Attention Mechanism
Attention Mechanism

A neural network technique that dynamically weights input elements by their relevance to the task.

Generality: 875
Multi-Head Attention
Multi-Head Attention

Attention mechanism that jointly attends to information from multiple representation subspaces simultaneously.

Generality: 794