Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Speculative Decoding

Speculative Decoding

A technique that accelerates LLM inference by drafting and verifying token sequences in parallel.

Year: 2023Generality: 520
Back to Vocab

Speculative decoding is an inference acceleration technique for large language models (LLMs) that exploits the asymmetry between generating tokens and verifying them. Rather than producing one token at a time through expensive forward passes of a large model, speculative decoding uses a smaller, faster "draft" model to speculatively generate several candidate tokens ahead. The large "target" model then evaluates all draft tokens in a single parallel forward pass, accepting those that match what it would have produced and discarding the rest. Because verification is cheaper than generation at scale, this approach can yield substantial wall-clock speedups—often 2–3× or more—without changing the output distribution of the target model.

The mechanics rely on a careful acceptance-rejection scheme that preserves the statistical guarantees of the target model. When the target model agrees with a draft token, it is accepted outright. When it disagrees, a token is sampled from a corrected distribution that ensures the final output is mathematically equivalent to sampling directly from the target model. This lossless property is critical: speculative decoding is not an approximation but an exact method, making it suitable for production systems where output fidelity is non-negotiable. Variants such as speculative sampling, assisted generation, and Medusa extend the core idea using multiple draft heads or retrieval-based candidates.

Speculative decoding matters because inference latency is a primary bottleneck in deploying LLMs at scale. As models grow into the hundreds of billions of parameters, generating even a short response can take seconds, limiting real-time applications. By parallelizing token verification across modern GPU architectures, speculative decoding reclaims much of the hardware's idle capacity during memory-bound autoregressive generation. It integrates cleanly with other optimizations like quantization and KV-cache management, making it a practical and widely adopted component of modern LLM serving stacks.

The technique was formalized and brought to prominence in 2023 through concurrent work from Google and other research groups, though the underlying intuition connects to speculative execution in computer architecture. Its rapid adoption across inference frameworks such as Hugging Face's Transformers and vLLM reflects how directly it addresses the cost and latency pressures of serving frontier language models.

Related

Related

Self-Speculative Decoding
Self-Speculative Decoding

A technique where a single model drafts and verifies tokens to accelerate inference.

Generality: 186
Token Speculation Techniques
Token Speculation Techniques

Methods that predict multiple candidate tokens in parallel to accelerate text generation.

Generality: 450
Speculative Edits
Speculative Edits

Proactively generating candidate edits before confirmation to reduce latency and improve efficiency.

Generality: 450
Disaggregated Inference
Disaggregated Inference

A model serving technique that separates LLM inference into specialized hardware for prompt processing (prefill) and token generation (decode), enabling higher throughput and lower power at the cost of deployment flexibility.

Generality: 710
Greedy Decoding
Greedy Decoding

A sequence generation strategy that always selects the single most probable next token.

Generality: 601
Inference Scaling
Inference Scaling

Improving model outputs by allocating more compute during inference rather than during training

Generality: 812