Transformer Block

A transformer block is the fundamental repeating unit of the transformer architecture, introduced in the 2017 paper "Attention Is All You Need." Each block consists of two primary sublayers: a multi-head self-attention mechanism and a position-wise feedforward network. Both sublayers are wrapped with residual connections and layer normalization, which stabilize training and allow gradients to flow effectively through deep stacks of these blocks. In practice, a full transformer model is built by chaining many such blocks sequentially, with the depth of the stack being a primary lever for scaling model capacity.

The self-attention sublayer is the defining innovation of the transformer block. It allows every position in a sequence to directly attend to every other position, computing a weighted combination of value vectors based on the compatibility between query and key vectors. This mechanism captures long-range dependencies without the sequential bottleneck that plagued recurrent architectures like LSTMs and GRUs. Multi-head attention extends this by running several attention operations in parallel across different learned subspaces, enabling the model to jointly attend to information from different representational perspectives simultaneously.

The feedforward sublayer applies the same two-layer fully connected network independently to each position, introducing nonlinearity and expanding the model's representational capacity beyond what attention alone provides. Together, the attention and feedforward components create a powerful combination: attention routes and aggregates information across the sequence, while the feedforward network transforms it at each position. This division of labor has proven remarkably effective across domains including natural language processing, computer vision, audio, and protein structure prediction.

The transformer block's impact on modern AI is difficult to overstate. It forms the backbone of virtually every large-scale foundation model, from BERT and GPT to PaLM, LLaMA, and Vision Transformers (ViT). Its design is highly amenable to parallelization on modern hardware accelerators, enabling training at scales previously impractical with recurrent models. Understanding the transformer block is essentially a prerequisite for understanding contemporary deep learning, as it has become the default architectural primitive across nearly every modality and task.

Transformer Block

Related

Transformer Block

Related

Related

Related