Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Self-Attention

Self-Attention

A mechanism that lets neural networks weigh relationships between all parts of an input simultaneously.

Year: 2017Generality: 794
Back to Vocab

Self-attention is a neural network mechanism that computes relationships between every element in a sequence and every other element, allowing the model to dynamically determine how much focus to place on each part of the input when representing any given position. Rather than processing tokens in isolation or relying on fixed local context windows, self-attention produces a weighted combination of all input representations, where the weights reflect learned relevance scores between pairs of elements. This enables the model to capture both short- and long-range dependencies within a single, parallelizable operation.

The mechanism works by projecting each input token into three vectors — queries, keys, and values — using learned weight matrices. Attention scores are computed as the scaled dot product between a query vector and all key vectors, then passed through a softmax to produce a probability distribution. These scores weight the corresponding value vectors, and the resulting sum becomes the new representation for that position. In practice, multiple sets of query-key-value projections are run in parallel (multi-head attention), allowing the model to simultaneously attend to different types of relationships across different representation subspaces.

Self-attention became central to machine learning with the 2017 paper "Attention Is All You Need" by Vaswani et al., which introduced the Transformer architecture built entirely around this mechanism. Prior sequence models relied on recurrent or convolutional layers that struggled with long-range dependencies and were difficult to parallelize during training. Self-attention addressed both limitations, enabling dramatically faster training on modern hardware and superior modeling of distant contextual relationships.

The impact of self-attention on the field has been profound. It underpins virtually every major language model developed since 2017, including BERT, GPT, T5, and their successors, and has since been applied far beyond NLP to vision, audio, protein structure prediction, and multimodal learning. Its ability to flexibly route information across an entire context window — without the bottlenecks of sequential processing — makes it one of the most consequential architectural innovations in modern deep learning.

Related

Related

Attention
Attention

A mechanism enabling neural networks to dynamically focus on relevant parts of input.

Generality: 875
Attention Mechanisms
Attention Mechanisms

Neural network components that dynamically weight input elements by their contextual relevance.

Generality: 865
Attention Network
Attention Network

A neural network that dynamically weights input elements to capture relevant context.

Generality: 796
Attention Mechanism
Attention Mechanism

A neural network technique that dynamically weights input elements by their relevance to the task.

Generality: 875
Transformer
Transformer

A neural network architecture using self-attention to process sequential data in parallel.

Generality: 900
Attention Block
Attention Block

A neural network module that selectively weighs input elements by their contextual relevance.

Generality: 752