A neural architecture that maps variable-length input sequences to variable-length output sequences.
Sequence-to-sequence (Seq2Seq) is a neural network framework designed to transform one sequential input into another sequential output, where the two sequences may differ in length and structure. The architecture consists of two coupled components: an encoder, which reads the input sequence step-by-step and compresses it into a fixed-dimensional context vector representing the input's meaning, and a decoder, which uses that context vector as its initial state to generate the output sequence one token at a time. Both components are typically implemented as recurrent networks — originally LSTMs or GRUs — allowing the model to handle variable-length inputs and outputs without requiring a fixed alignment between them. This flexibility made Seq2Seq immediately applicable to machine translation, speech recognition, text summarization, and dialogue systems.
The framework was introduced in 2014 by Sutskever, Vinyals, and Le at Google, who demonstrated that stacked LSTMs could learn surprisingly effective input-output mappings across languages. A critical limitation of the original design, however, was the bottleneck imposed by compressing an entire input sequence into a single fixed-size vector — long sequences caused information loss and degraded output quality. This motivated the development of attention mechanisms, which allow the decoder to dynamically query different parts of the encoded input at each generation step rather than relying solely on a single summary vector. Attention-augmented Seq2Seq models substantially improved translation quality and became the dominant paradigm in NLP for several years.
Seq2Seq laid the conceptual groundwork for the Transformer architecture, which replaced recurrence entirely with self-attention but preserved the encoder-decoder structure. Modern large language models and translation systems trace their lineage directly to the Seq2Seq framework. Beyond language, the paradigm has been applied to code generation, molecule design, time-series forecasting, and any domain where structured inputs must be mapped to structured outputs of differing length — making it one of the most broadly influential architectural ideas in deep learning.