Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Self-Speculative Decoding

Self-Speculative Decoding

A technique where a single model drafts and verifies tokens to accelerate inference.

Year: 2023Generality: 186
Back to Vocab

Self-speculative decoding is an inference acceleration technique for large language models (LLMs) in which a single model acts as both the draft model and the verification model. Unlike standard speculative decoding, which requires a separate smaller model to generate candidate tokens, self-speculative decoding exploits the target model's own internal representations—typically by skipping certain intermediate layers during the drafting phase—to produce candidate token sequences cheaply before verifying them with a full forward pass. This eliminates the need to maintain and coordinate two separate models while still capturing the core speedup benefit of speculative decoding.

The mechanism works in two stages. First, the model runs in a lightweight "draft" mode, often by bypassing a subset of its transformer layers, to rapidly generate several candidate tokens in sequence. Second, the full model verifies all candidate tokens in a single parallel forward pass, accepting those that match what the complete model would have produced and discarding the rest. Because verification is parallelized across candidates, the approach can yield significant wall-clock speedups—often 1.5× to 2×—without any change to the output distribution, preserving the model's original quality guarantees.

Self-speculative decoding matters because deploying a separate draft model introduces practical burdens: the draft model must be carefully matched to the target model's vocabulary and output distribution, and both models must reside in memory simultaneously. By collapsing draft and verification into one model, self-speculative decoding reduces memory overhead and simplifies deployment pipelines, making it especially attractive for resource-constrained inference environments. It also sidesteps the need to train or fine-tune a companion draft model.

The technique emerged from the broader speculative decoding literature around 2023, as researchers sought ways to make LLM inference faster without sacrificing output fidelity. It represents a practical convergence of ideas from early-exit networks, layer-skipping methods, and speculative execution, and has quickly become a relevant strategy for production LLM serving systems where latency and hardware efficiency are critical concerns.

Related

Related

Speculative Decoding
Speculative Decoding

A technique that accelerates LLM inference by drafting and verifying token sequences in parallel.

Generality: 520
Token Speculation Techniques
Token Speculation Techniques

Methods that predict multiple candidate tokens in parallel to accelerate text generation.

Generality: 450
Speculative Edits
Speculative Edits

Proactively generating candidate edits before confirmation to reduce latency and improve efficiency.

Generality: 450
Disaggregated Inference
Disaggregated Inference

A model serving technique that separates LLM inference into specialized hardware for prompt processing (prefill) and token generation (decode), enabling higher throughput and lower power at the cost of deployment flexibility.

Generality: 710
Self-Reasoning Token
Self-Reasoning Token

Specialized tokens that train language models to anticipate and plan for future outputs.

Generality: 104
Greedy Decoding
Greedy Decoding

A sequence generation strategy that always selects the single most probable next token.

Generality: 601