A transformer architecture that encodes input sequences and decodes them into outputs.
The encoder-decoder transformer is a neural network architecture built entirely on attention mechanisms, introduced in the landmark 2017 paper "Attention is All You Need." Unlike earlier sequence-to-sequence models that relied on recurrent neural networks, this architecture processes input tokens in parallel rather than sequentially. The encoder stack reads the full input sequence and produces a rich set of contextual representations, while the decoder stack generates the output sequence one token at a time, attending to both its own previously generated tokens and the encoder's representations through a mechanism called cross-attention.
The encoder consists of multiple identical layers, each containing a multi-head self-attention sublayer and a feed-forward sublayer. Self-attention allows every position in the input to directly attend to every other position, capturing long-range dependencies that RNNs struggled to maintain. The decoder mirrors this structure but adds a cross-attention sublayer that queries the encoder's output, enabling the model to selectively focus on relevant parts of the input when producing each output token. Positional encodings are added to token embeddings to inject sequence order information, since the architecture itself has no inherent notion of position.
This design excels at tasks requiring a mapping between two sequences of potentially different lengths and structures, making it the dominant architecture for machine translation, abstractive summarization, and speech recognition. The parallelizability of attention over sequence positions dramatically accelerated training on modern hardware compared to recurrent alternatives, enabling models to scale to far larger datasets and parameter counts.
While later architectures like BERT adopted encoder-only designs and GPT adopted decoder-only designs for different task families, the full encoder-decoder structure remains essential wherever explicit conditioning on a complete input sequence is required. Models such as T5, BART, and mT5 demonstrate its continued relevance, applying the encoder-decoder framework to a broad range of language understanding and generation tasks through large-scale pretraining.