Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Transformer

Transformer

A neural network architecture using self-attention to process sequential data in parallel.

Year: 2017Generality: 900
Back to Vocab

The Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" that fundamentally changed how models handle sequential data. Rather than processing tokens one at a time as recurrent networks do, the Transformer uses a mechanism called self-attention to compute relationships between all positions in a sequence simultaneously. Each token attends to every other token, producing weighted representations that capture context regardless of distance — a critical advantage over RNNs, which struggle to propagate information across long sequences. The original architecture pairs an encoder, which builds rich contextual representations of the input, with a decoder that generates output token by token while attending to both its own prior outputs and the encoder's representations.

The self-attention mechanism works by projecting each token into three vectors — query, key, and value — then computing dot-product similarities between queries and keys to determine how much each token should "attend" to every other. These attention scores are scaled, softmax-normalized, and used to produce a weighted sum of value vectors. Multiple attention heads run in parallel, each learning to focus on different types of relationships, and their outputs are concatenated and projected. Positional encodings are added to token embeddings to inject sequence order information, since the architecture itself is otherwise permutation-invariant. Stacked layers of multi-head attention and feed-forward sublayers, each wrapped with residual connections and layer normalization, build increasingly abstract representations.

The Transformer's impact on machine learning has been extraordinary and continues to expand. It became the backbone of virtually every major language model, including BERT, GPT, T5, and their successors, enabling breakthroughs in translation, summarization, question answering, and code generation. Its parallelism makes it highly amenable to large-scale training on modern GPU and TPU hardware, facilitating the scaling laws that underpin today's large language models. Beyond NLP, the architecture has been successfully adapted for vision (Vision Transformers), audio, protein structure prediction, and reinforcement learning, establishing it as one of the most general and consequential architectural innovations in the history of deep learning.

Related

Related

Encoder-Decoder Transformer
Encoder-Decoder Transformer

A transformer architecture that encodes input sequences and decodes them into outputs.

Generality: 722
Self-Attention
Self-Attention

A mechanism that lets neural networks weigh relationships between all parts of an input simultaneously.

Generality: 794
Transformer Block
Transformer Block

A core neural network module combining self-attention and feedforward layers for sequence modeling.

Generality: 820
Attention
Attention

A mechanism enabling neural networks to dynamically focus on relevant parts of input.

Generality: 875
Attention Network
Attention Network

A neural network that dynamically weights input elements to capture relevant context.

Generality: 796
Multi-Head Attention
Multi-Head Attention

Attention mechanism that jointly attends to information from multiple representation subspaces simultaneously.

Generality: 794