Storing prompt-response pairs to avoid redundant computation in large language model systems.
Prompt caching is an optimization technique used in large language model (LLM) deployments that stores previously computed prompt-response pairs so that identical or near-identical inputs can be served from memory rather than reprocessed by the model. Because LLM inference is computationally expensive—requiring billions of floating-point operations per forward pass—recomputing responses to frequently repeated prompts wastes significant resources. By maintaining a cache of inputs and their corresponding outputs, systems can return stored results almost instantaneously, dramatically reducing latency and infrastructure costs at scale.
The mechanics of prompt caching vary by implementation. The simplest form uses exact-match hashing: a prompt is hashed, and if the hash exists in the cache, the stored output is returned directly. More sophisticated approaches cache intermediate computations, such as the key-value (KV) attention states generated during a forward pass, allowing the model to resume processing from a saved state rather than starting from scratch. This KV cache approach is especially valuable for long system prompts or shared context prefixes that appear across many user queries, since only the unique portion of each request needs fresh computation.
Prompt caching matters most in production environments where certain prompts recur at high frequency—customer service bots answering common questions, coding assistants with fixed system instructions, or retrieval-augmented generation pipelines with stable context blocks. In these settings, cache hit rates can be substantial, translating directly into lower API costs and faster response times. Cloud providers including Anthropic and OpenAI have introduced explicit prompt caching features in their APIs, making the technique accessible without requiring custom infrastructure.
While caching is a foundational concept in computer science, its application to LLM inference became practically significant around 2022–2023 as model sizes grew and inference costs became a dominant concern for organizations deploying AI at scale. The technique sits at the intersection of systems engineering and ML deployment, and its effectiveness depends heavily on prompt design—developers who structure prompts with stable prefixes followed by variable suffixes can maximize cache reuse and minimize unnecessary computation.