A sequence generation strategy that always selects the single most probable next token.
Greedy decoding is a inference-time strategy for autoregressive sequence generation in which a model selects the highest-probability token at each step, conditioned on all previously generated tokens. At every position in the output sequence, the model computes a probability distribution over its entire vocabulary and simply picks the argmax — the single most likely next token — before moving on to the next position. This process repeats until a stop condition is met, such as generating an end-of-sequence token or reaching a maximum length.
The appeal of greedy decoding lies in its simplicity and speed. Because only one candidate token is tracked at each step, the algorithm runs in linear time with respect to sequence length and requires no additional memory beyond what the model itself needs. This makes it attractive for latency-sensitive applications and large-scale deployments where computational cost is a primary concern. In practice, greedy decoding often produces fluent, coherent output for straightforward generation tasks, particularly when the model is well-trained and the target distribution is relatively peaked.
However, greedy decoding has a fundamental limitation: it is myopic. By committing to the locally optimal token at each step, it ignores how that choice constrains future tokens, and can lock the model into low-quality trajectories that a more globally aware search would avoid. A token that looks highly probable in isolation may lead to an awkward or semantically incoherent continuation. This is why alternatives such as beam search, top-k sampling, nucleus sampling, and temperature scaling were developed — each trades some computational efficiency for greater output quality or diversity.
Greedy decoding became a standard baseline in neural sequence-to-sequence modeling as encoder-decoder architectures rose to prominence in neural machine translation and text summarization research during the mid-2010s. It remains widely used today as a default decoding mode in many production systems and as a reference point against which more sophisticated decoding strategies are benchmarked. Understanding its trade-offs is essential for anyone designing or evaluating language generation pipelines.