Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Prompt Caching

Prompt Caching

Storing prompt-response pairs to avoid redundant computation in large language model systems.

Year: 2022Generality: 339
Back to Vocab

Prompt caching is an optimization technique used in large language model (LLM) deployments that stores previously computed prompt-response pairs so that identical or near-identical inputs can be served from memory rather than reprocessed by the model. Because LLM inference is computationally expensive—requiring billions of floating-point operations per forward pass—recomputing responses to frequently repeated prompts wastes significant resources. By maintaining a cache of inputs and their corresponding outputs, systems can return stored results almost instantaneously, dramatically reducing latency and infrastructure costs at scale.

The mechanics of prompt caching vary by implementation. The simplest form uses exact-match hashing: a prompt is hashed, and if the hash exists in the cache, the stored output is returned directly. More sophisticated approaches cache intermediate computations, such as the key-value (KV) attention states generated during a forward pass, allowing the model to resume processing from a saved state rather than starting from scratch. This KV cache approach is especially valuable for long system prompts or shared context prefixes that appear across many user queries, since only the unique portion of each request needs fresh computation.

Prompt caching matters most in production environments where certain prompts recur at high frequency—customer service bots answering common questions, coding assistants with fixed system instructions, or retrieval-augmented generation pipelines with stable context blocks. In these settings, cache hit rates can be substantial, translating directly into lower API costs and faster response times. Cloud providers including Anthropic and OpenAI have introduced explicit prompt caching features in their APIs, making the technique accessible without requiring custom infrastructure.

While caching is a foundational concept in computer science, its application to LLM inference became practically significant around 2022–2023 as model sizes grew and inference costs became a dominant concern for organizations deploying AI at scale. The technique sits at the intersection of systems engineering and ML deployment, and its effectiveness depends heavily on prompt design—developers who structure prompts with stable prefixes followed by variable suffixes can maximize cache reuse and minimize unnecessary computation.

Related

Related

Prompt
Prompt

A text input given to a language model to elicit a desired response.

Generality: 796
Prompt Engineering
Prompt Engineering

Crafting input text strategically to elicit desired outputs from AI language models.

Generality: 694
System Prompt Learning
System Prompt Learning

Automatically optimizing persistent model instructions to steer behavior without full retraining.

Generality: 520
Super Prompting
Super Prompting

Crafting highly specific input prompts to steer AI models toward desired outputs.

Generality: 450
System Prompt
System Prompt

Hidden instructions given to a language model that shape its behavior and persona.

Generality: 620
Prompt Chaining
Prompt Chaining

Linking sequential prompts so each output feeds the next, enabling complex multi-step reasoning.

Generality: 463