Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Disaggregated Inference

Disaggregated Inference

A model serving technique that separates LLM inference into specialized hardware for prompt processing (prefill) and token generation (decode), enabling higher throughput and lower power at the cost of deployment flexibility.

Year: 2024Generality: 710
Back to Vocab

Disaggregated inference is a model serving technique that separates traditional LLM inference into two computationally distinct phases — prompt processing (prefill) and token generation (decode) — each running on purpose-built hardware rather than shared infrastructure. Prefill processes the input prompt in parallel and is relatively light on memory bandwidth. Decode generates output tokens sequentially, one by one, making it memory-bandwidth intensive and inherently serial. Because these two phases have opposite hardware requirements, disaggregation allows each to run on specialized silicon optimized for its specific compute pattern.

The appeal is throughput and power efficiency. A system tuned for decode — wide memory bandwidth, large caches — can generate tokens faster per watt than a general-purpose chip doing both jobs. Hyperscalers with massive, heterogeneous fleets benefit most: when workload characteristics shift, they can route traffic across processors to maintain utilization. Enterprises and neoclouds with fixed hardware deployments and long depreciation schedules face a harder calculation — if traffic patterns change, their fixed prefill/decode ratio becomes a liability, stranding capacity and inflating costs.

The tradeoff is structural inflexibility. The prefill-to-decode hardware ratio is locked at deployment time, typically spanning five to six years of physical infrastructure. For predictable, stable workloads — consistent input/output ratios, stable cache hit rates — disaggregation delivers exceptional value. For rapidly evolving AI workloads, where traffic characteristics shift frequently, the rigidity may outweigh the efficiency gains.

The technology is early. How much of the AI data center market will adopt disaggregation long-term remains genuinely uncertain — the history of computer architecture suggests general-purpose solutions often win out when workloads are volatile, while specialization wins when they stabilize.

Related

Related

Inference Scaling
Inference Scaling

Improving model outputs by allocating more compute during inference rather than during training

Generality: 812
Speculative Decoding
Speculative Decoding

A technique that accelerates LLM inference by drafting and verifying token sequences in parallel.

Generality: 520
Inference Acceleration
Inference Acceleration

Techniques and hardware that speed up neural network prediction without sacrificing accuracy.

Generality: 694
Self-Speculative Decoding
Self-Speculative Decoding

A technique where a single model drafts and verifies tokens to accelerate inference.

Generality: 186
Inference-Time Reasoning
Inference-Time Reasoning

A trained model's process of applying learned knowledge to generate outputs on new data.

Generality: 751
Inference
Inference

Using a trained model to generate predictions or decisions on new, unseen data.

Generality: 875