Methods that predict multiple candidate tokens in parallel to accelerate text generation.
Token speculation techniques are inference-time strategies that allow language models to generate text faster by predicting and evaluating multiple candidate tokens—or even entire sequences of tokens—simultaneously, rather than producing one token at a time in strict autoregressive fashion. The most prominent instantiation is speculative decoding, in which a smaller, faster "draft" model proposes a sequence of candidate tokens that a larger, more capable "verifier" model then accepts or rejects in parallel. Because the verifier can process the entire draft in a single forward pass, accepted tokens come at a fraction of the usual cost, yielding significant wall-clock speedups without any change to the output distribution.
The mechanics rely on the observation that autoregressive sampling is memory-bandwidth-bound rather than compute-bound for large models: the GPU spends most of its time loading model weights rather than performing arithmetic. By batching the verification of several speculative tokens into one pass, speculation shifts the bottleneck and dramatically improves hardware utilization. Acceptance rates—the fraction of draft tokens the verifier keeps—depend on how well the draft model approximates the target distribution, so choosing or training an appropriate draft model is central to practical performance gains.
Beyond the classic draft-then-verify paradigm, researchers have explored self-speculation (where the same model generates drafts using early-exit layers), tree-structured speculation (where multiple draft branches are verified simultaneously), and retrieval-augmented speculation (where common n-grams are proposed from a lookup table). Each variant trades off draft quality, memory overhead, and implementation complexity differently, making the choice highly context-dependent.
Token speculation has become increasingly important as large language models are deployed in latency-sensitive applications such as real-time chat, code completion, and interactive agents. Because the technique is lossless—the output distribution is provably identical to standard sampling—it offers a rare free lunch in systems engineering: faster responses with no degradation in generation quality. As model sizes continue to grow, speculation-based inference is widely expected to become a standard component of production serving stacks.