Generating sequences by predicting each element conditioned on all previous outputs.
Autoregressive generation is a modeling approach in which a sequence is produced one element at a time, with each new element predicted based on all previously generated elements. Formally, this exploits the chain rule of probability: the joint probability of a sequence is decomposed into a product of conditional probabilities, where each token's distribution is conditioned on its predecessors. This decomposition transforms the complex problem of modeling an entire sequence into a series of tractable conditional predictions, making it well-suited for tasks where order and context are essential — including text generation, speech synthesis, music composition, and image generation.
In practice, autoregressive models are trained to minimize the negative log-likelihood of each token given its context, a process known as teacher forcing. At inference time, the model generates tokens sequentially: it samples or selects the next token from the predicted distribution, appends it to the growing context, and repeats until a stopping condition is met. This left-to-right (or otherwise ordered) generation process ensures that each output is globally coherent with what came before, but it also means generation is inherently sequential and cannot be trivially parallelized at inference time — a key computational trade-off.
The approach gained enormous prominence in deep learning with the rise of recurrent neural networks and later transformer-based architectures. Models such as OpenAI's GPT series demonstrated that large-scale autoregressive pretraining on text could yield systems capable of remarkably fluent and contextually appropriate language generation. Beyond NLP, autoregressive methods have been applied in PixelCNN for image generation, WaveNet for raw audio synthesis, and various multimodal systems, illustrating the paradigm's broad applicability across data modalities.
Autoregressive generation remains central to modern AI because it provides a principled probabilistic framework that scales well with data and model size. Its primary limitation — slow sequential decoding — has spurred active research into acceleration techniques such as speculative decoding and non-autoregressive alternatives. Nevertheless, the autoregressive paradigm continues to underpin many of the most capable generative models in production today, making it one of the most consequential ideas in contemporary machine learning.