Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Transformer Block

Transformer Block

A core neural network module combining self-attention and feedforward layers for sequence modeling.

Year: 2017Generality: 820
Back to Vocab

A transformer block is the fundamental repeating unit of the transformer architecture, introduced in the 2017 paper "Attention Is All You Need." Each block consists of two primary sublayers: a multi-head self-attention mechanism and a position-wise feedforward network. Both sublayers are wrapped with residual connections and layer normalization, which stabilize training and allow gradients to flow effectively through deep stacks of these blocks. In practice, a full transformer model is built by chaining many such blocks sequentially, with the depth of the stack being a primary lever for scaling model capacity.

The self-attention sublayer is the defining innovation of the transformer block. It allows every position in a sequence to directly attend to every other position, computing a weighted combination of value vectors based on the compatibility between query and key vectors. This mechanism captures long-range dependencies without the sequential bottleneck that plagued recurrent architectures like LSTMs and GRUs. Multi-head attention extends this by running several attention operations in parallel across different learned subspaces, enabling the model to jointly attend to information from different representational perspectives simultaneously.

The feedforward sublayer applies the same two-layer fully connected network independently to each position, introducing nonlinearity and expanding the model's representational capacity beyond what attention alone provides. Together, the attention and feedforward components create a powerful combination: attention routes and aggregates information across the sequence, while the feedforward network transforms it at each position. This division of labor has proven remarkably effective across domains including natural language processing, computer vision, audio, and protein structure prediction.

The transformer block's impact on modern AI is difficult to overstate. It forms the backbone of virtually every large-scale foundation model, from BERT and GPT to PaLM, LLaMA, and Vision Transformers (ViT). Its design is highly amenable to parallelization on modern hardware accelerators, enabling training at scales previously impractical with recurrent models. Understanding the transformer block is essentially a prerequisite for understanding contemporary deep learning, as it has become the default architectural primitive across nearly every modality and task.

Related

Related

Transformer
Transformer

A neural network architecture using self-attention to process sequential data in parallel.

Generality: 900
Attention Block
Attention Block

A neural network module that selectively weighs input elements by their contextual relevance.

Generality: 752
Encoder-Decoder Transformer
Encoder-Decoder Transformer

A transformer architecture that encodes input sequences and decodes them into outputs.

Generality: 722
Multi-Head Attention
Multi-Head Attention

Attention mechanism that jointly attends to information from multiple representation subspaces simultaneously.

Generality: 794
Self-Attention
Self-Attention

A mechanism that lets neural networks weigh relationships between all parts of an input simultaneously.

Generality: 794
Attention
Attention

A mechanism enabling neural networks to dynamically focus on relevant parts of input.

Generality: 875