Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Greedy Decoding

Greedy Decoding

A sequence generation strategy that always selects the single most probable next token.

Year: 2014Generality: 601
Back to Vocab

Greedy decoding is a inference-time strategy for autoregressive sequence generation in which a model selects the highest-probability token at each step, conditioned on all previously generated tokens. At every position in the output sequence, the model computes a probability distribution over its entire vocabulary and simply picks the argmax — the single most likely next token — before moving on to the next position. This process repeats until a stop condition is met, such as generating an end-of-sequence token or reaching a maximum length.

The appeal of greedy decoding lies in its simplicity and speed. Because only one candidate token is tracked at each step, the algorithm runs in linear time with respect to sequence length and requires no additional memory beyond what the model itself needs. This makes it attractive for latency-sensitive applications and large-scale deployments where computational cost is a primary concern. In practice, greedy decoding often produces fluent, coherent output for straightforward generation tasks, particularly when the model is well-trained and the target distribution is relatively peaked.

However, greedy decoding has a fundamental limitation: it is myopic. By committing to the locally optimal token at each step, it ignores how that choice constrains future tokens, and can lock the model into low-quality trajectories that a more globally aware search would avoid. A token that looks highly probable in isolation may lead to an awkward or semantically incoherent continuation. This is why alternatives such as beam search, top-k sampling, nucleus sampling, and temperature scaling were developed — each trades some computational efficiency for greater output quality or diversity.

Greedy decoding became a standard baseline in neural sequence-to-sequence modeling as encoder-decoder architectures rose to prominence in neural machine translation and text summarization research during the mid-2010s. It remains widely used today as a default decoding mode in many production systems and as a reference point against which more sophisticated decoding strategies are benchmarked. Understanding its trade-offs is essential for anyone designing or evaluating language generation pipelines.

Related

Related

Speculative Decoding
Speculative Decoding

A technique that accelerates LLM inference by drafting and verifying token sequences in parallel.

Generality: 520
Autoregressive Generation
Autoregressive Generation

Generating sequences by predicting each element conditioned on all previous outputs.

Generality: 794
Self-Speculative Decoding
Self-Speculative Decoding

A technique where a single model drafts and verifies tokens to accelerate inference.

Generality: 186
Autoregressive Sequence Generator
Autoregressive Sequence Generator

A model that predicts each next output using its own previous outputs as inputs.

Generality: 752
Token Speculation Techniques
Token Speculation Techniques

Methods that predict multiple candidate tokens in parallel to accelerate text generation.

Generality: 450
Encoder-Decoder Models
Encoder-Decoder Models

Deep learning architectures that compress input into a representation and generate output.

Generality: 792