Speculative Decoding

Speculative decoding is an inference acceleration technique for large language models (LLMs) that exploits the asymmetry between generating tokens and verifying them. Rather than producing one token at a time through expensive forward passes of a large model, speculative decoding uses a smaller, faster "draft" model to speculatively generate several candidate tokens ahead. The large "target" model then evaluates all draft tokens in a single parallel forward pass, accepting those that match what it would have produced and discarding the rest. Because verification is cheaper than generation at scale, this approach can yield substantial wall-clock speedups—often 2–3× or more—without changing the output distribution of the target model.

The mechanics rely on a careful acceptance-rejection scheme that preserves the statistical guarantees of the target model. When the target model agrees with a draft token, it is accepted outright. When it disagrees, a token is sampled from a corrected distribution that ensures the final output is mathematically equivalent to sampling directly from the target model. This lossless property is critical: speculative decoding is not an approximation but an exact method, making it suitable for production systems where output fidelity is non-negotiable. Variants such as speculative sampling, assisted generation, and Medusa extend the core idea using multiple draft heads or retrieval-based candidates.

Speculative decoding matters because inference latency is a primary bottleneck in deploying LLMs at scale. As models grow into the hundreds of billions of parameters, generating even a short response can take seconds, limiting real-time applications. By parallelizing token verification across modern GPU architectures, speculative decoding reclaims much of the hardware's idle capacity during memory-bound autoregressive generation. It integrates cleanly with other optimizations like quantization and KV-cache management, making it a practical and widely adopted component of modern LLM serving stacks.

The technique was formalized and brought to prominence in 2023 through concurrent work from Google and other research groups, though the underlying intuition connects to speculative execution in computer architecture. Its rapid adoption across inference frameworks such as Hugging Face's Transformers and vLLM reflects how directly it addresses the cost and latency pressures of serving frontier language models.

Speculative Decoding

Related

Speculative Decoding

Related

Related

Related