A model serving technique that separates LLM inference into specialized hardware for prompt processing (prefill) and token generation (decode), enabling higher throughput and lower power at the cost of deployment flexibility.
Disaggregated inference is a model serving technique that separates traditional LLM inference into two computationally distinct phases — prompt processing (prefill) and token generation (decode) — each running on purpose-built hardware rather than shared infrastructure. Prefill processes the input prompt in parallel and is relatively light on memory bandwidth. Decode generates output tokens sequentially, one by one, making it memory-bandwidth intensive and inherently serial. Because these two phases have opposite hardware requirements, disaggregation allows each to run on specialized silicon optimized for its specific compute pattern.
The appeal is throughput and power efficiency. A system tuned for decode — wide memory bandwidth, large caches — can generate tokens faster per watt than a general-purpose chip doing both jobs. Hyperscalers with massive, heterogeneous fleets benefit most: when workload characteristics shift, they can route traffic across processors to maintain utilization. Enterprises and neoclouds with fixed hardware deployments and long depreciation schedules face a harder calculation — if traffic patterns change, their fixed prefill/decode ratio becomes a liability, stranding capacity and inflating costs.
The tradeoff is structural inflexibility. The prefill-to-decode hardware ratio is locked at deployment time, typically spanning five to six years of physical infrastructure. For predictable, stable workloads — consistent input/output ratios, stable cache hit rates — disaggregation delivers exceptional value. For rapidly evolving AI workloads, where traffic characteristics shift frequently, the rigidity may outweigh the efficiency gains.
The technology is early. How much of the AI data center market will adopt disaggregation long-term remains genuinely uncertain — the history of computer architecture suggests general-purpose solutions often win out when workloads are volatile, while specialization wins when they stabilize.