A neural network architecture using self-attention to process sequential data in parallel.
The Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" that fundamentally changed how models handle sequential data. Rather than processing tokens one at a time as recurrent networks do, the Transformer uses a mechanism called self-attention to compute relationships between all positions in a sequence simultaneously. Each token attends to every other token, producing weighted representations that capture context regardless of distance — a critical advantage over RNNs, which struggle to propagate information across long sequences. The original architecture pairs an encoder, which builds rich contextual representations of the input, with a decoder that generates output token by token while attending to both its own prior outputs and the encoder's representations.
The self-attention mechanism works by projecting each token into three vectors — query, key, and value — then computing dot-product similarities between queries and keys to determine how much each token should "attend" to every other. These attention scores are scaled, softmax-normalized, and used to produce a weighted sum of value vectors. Multiple attention heads run in parallel, each learning to focus on different types of relationships, and their outputs are concatenated and projected. Positional encodings are added to token embeddings to inject sequence order information, since the architecture itself is otherwise permutation-invariant. Stacked layers of multi-head attention and feed-forward sublayers, each wrapped with residual connections and layer normalization, build increasingly abstract representations.
The Transformer's impact on machine learning has been extraordinary and continues to expand. It became the backbone of virtually every major language model, including BERT, GPT, T5, and their successors, enabling breakthroughs in translation, summarization, question answering, and code generation. Its parallelism makes it highly amenable to large-scale training on modern GPU and TPU hardware, facilitating the scaling laws that underpin today's large language models. Beyond NLP, the architecture has been successfully adapted for vision (Vision Transformers), audio, protein structure prediction, and reinforcement learning, establishing it as one of the most general and consequential architectural innovations in the history of deep learning.