Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Prompt caching is a technique for large language models that stores prefixes of repeated prompts to avoid recomputation, significantly reducing both cost and latency on subsequent requests. It works by detecting matching initial prompt segments across API calls and reusing key-value cache computations from prior inferences instead of regenerating them. 1)
When an LLM processes a prompt, it computes internal representations (key-value pairs in the attention mechanism) for every token. If a subsequent request shares the same prefix, prompt caching allows the model to skip recomputing those representations and instead load them from a cache. The savings scale with the length of the shared prefix and the frequency of repeated requests. 2)
Anthropic provides explicit control over caching via the cache_control parameter with a value of {“type”: “ephemeral”} to mark cacheable sections such as system prompts, large documents, or tool schemas. 3)
Key characteristics:
A common pattern is caching a large document in the system prompt while varying user messages across requests. 4)
OpenAI uses automatic caching with no code changes needed for prompts of 1,024 tokens or more, checking prefixes in 128-token increments. 5)
Key characteristics:
In testing, repeating identical requests yields approximately 50% cache hit rates. 6)
Google Gemini supports context caching for repeated contexts like long documents, tracked via usage_metadata in responses. The implementation is automatic for Gemini models, with pricing aligned to Google token rates emphasizing reuse for production workloads. 7)
Real-world deployments report 70-90% token cost reduction in systems with static prefixes combined with variable queries, such as RAG pipelines, tool-using agents, and chatbots with long system prompts. 8)
Good candidates:
Poor candidates: