Deep learning architectures that compress input into a representation and generate output.
Encoder-decoder models are a class of neural network architectures built around two cooperating components: an encoder that transforms raw input into a compact internal representation, and a decoder that uses that representation to produce a desired output. The encoder progressively distills the input—whether text, images, or audio—into a latent vector or sequence of vectors that captures its essential structure. The decoder then conditions on this representation to generate output step by step, making the architecture naturally suited to tasks where input and output differ in form, length, or modality.
The architecture gained widespread attention in 2014 with the introduction of sequence-to-sequence learning, which applied recurrent neural networks to tasks like machine translation. In that setting, an RNN encoder reads a source-language sentence token by token, producing a context vector, while an RNN decoder generates the target-language translation autoregressively. A critical refinement came with the addition of attention mechanisms, which allowed the decoder to selectively focus on different parts of the encoder's output at each generation step rather than relying on a single fixed-length vector—dramatically improving performance on long sequences.
The encoder-decoder paradigm extends well beyond text. In image segmentation, convolutional encoders downsample spatial features while decoders upsample them back to pixel-level predictions. In speech synthesis and recognition, the architecture maps between acoustic signals and linguistic representations. The 2017 Transformer model, built entirely on attention, recast the encoder-decoder framework in a highly parallelizable form and became the backbone of modern large language models, translation systems, and multimodal AI.
Encoder-decoder models matter because they provide a principled way to handle the fundamental challenge of cross-domain mapping: taking structured information in one space and producing coherent, contextually appropriate output in another. Their modularity also enables transfer learning—pretrained encoders can be paired with task-specific decoders—making them central to contemporary approaches in NLP, computer vision, and generative AI.