A technique where a single model drafts and verifies tokens to accelerate inference.
Self-speculative decoding is an inference acceleration technique for large language models (LLMs) in which a single model acts as both the draft model and the verification model. Unlike standard speculative decoding, which requires a separate smaller model to generate candidate tokens, self-speculative decoding exploits the target model's own internal representations—typically by skipping certain intermediate layers during the drafting phase—to produce candidate token sequences cheaply before verifying them with a full forward pass. This eliminates the need to maintain and coordinate two separate models while still capturing the core speedup benefit of speculative decoding.
The mechanism works in two stages. First, the model runs in a lightweight "draft" mode, often by bypassing a subset of its transformer layers, to rapidly generate several candidate tokens in sequence. Second, the full model verifies all candidate tokens in a single parallel forward pass, accepting those that match what the complete model would have produced and discarding the rest. Because verification is parallelized across candidates, the approach can yield significant wall-clock speedups—often 1.5× to 2×—without any change to the output distribution, preserving the model's original quality guarantees.
Self-speculative decoding matters because deploying a separate draft model introduces practical burdens: the draft model must be carefully matched to the target model's vocabulary and output distribution, and both models must reside in memory simultaneously. By collapsing draft and verification into one model, self-speculative decoding reduces memory overhead and simplifies deployment pipelines, making it especially attractive for resource-constrained inference environments. It also sidesteps the need to train or fine-tune a companion draft model.
The technique emerged from the broader speculative decoding literature around 2023, as researchers sought ways to make LLM inference faster without sacrificing output fidelity. It represents a practical convergence of ideas from early-exit networks, layer-skipping methods, and speculative execution, and has quickly become a relevant strategy for production LLM serving systems where latency and hardware efficiency are critical concerns.