Prompt Caching
Prompt caching is a technique for large language models that stores prefixes of repeated prompts to avoid recomputation, significantly reducing both cost and latency on subsequent requests. It works by detecting matching initial prompt segments across API calls and reusing key-value cache computations from prior inferences instead of regenerating them. 1)
How Prompt Caching Works
When an LLM processes a prompt, it computes internal representations (key-value pairs in the attention mechanism) for every token. If a subsequent request shares the same prefix, prompt caching allows the model to skip recomputing those representations and instead load them from a cache. The savings scale with the length of the shared prefix and the frequency of repeated requests. 2)
Anthropic Implementation
Anthropic provides explicit control over caching via the cache_control parameter with a value of {“type”: “ephemeral”} to mark cacheable sections such as system prompts, large documents, or tool schemas. 3)
Key characteristics:
Cache breakpoints: Up to four per prompt, enabling independent caching of sections processed in order: tools, system messages, then user messages
TTL: Five minutes, refreshing on each cache hit
Pricing: Cached input tokens cost approximately 90% less than regular input tokens, though cache writes incur a small additional cost
Metrics: Responses include cache_creation_input_tokens (misses), cache_read_input_tokens (hits), and input_tokens for monitoring
Minimum size: Cacheable content must be at least 1,024 tokens for Claude 3.5 Sonnet and 2,048 tokens for Claude 3 Opus
A common pattern is caching a large document in the system prompt while varying user messages across requests. 4)
OpenAI Implementation
OpenAI uses automatic caching with no code changes needed for prompts of 1,024 tokens or more, checking prefixes in 128-token increments. 5)
Key characteristics:
Automatic detection: The system auto-detects matching prefixes with best-effort reuse
Optional controls: prompt_cache_key and prompt_cache_retention parameters for customization
Pricing: 50% discount on cached input tokens
Metrics: prompt_tokens_details.cached_tokens in
API responses
Cache window: Short-lived, typically minutes
In testing, repeating identical requests yields approximately 50% cache hit rates. 6)
Google Gemini Context Caching
Google Gemini supports context caching for repeated contexts like long documents, tracked via usage_metadata in responses. The implementation is automatic for Gemini models, with pricing aligned to Google token rates emphasizing reuse for production workloads. 7)
Latency reduction: Up to 80-85% reduction in time-to-first-token on cache hits for long prompts
Cost savings: 50-90% depending on provider and cache hit rate
Cache hit rates: Up to 100% with Anthropic explicit control; approximately 50% with OpenAI automatic approach
Real-world deployments report 70-90% token cost reduction in systems with static prefixes combined with variable queries, such as RAG pipelines, tool-using agents, and chatbots with long system prompts. 8)
When to Use
Good candidates:
High-volume applications with repeated static content (system prompts over 1,024 tokens, documents, conversation history, tool definitions)
Chatbots, RAG pipelines, and agent workflows with consistent prefixes
Scenarios where the same large context is queried repeatedly
Poor candidates:
Short or highly variable prompts under 1,024 tokens
One-off requests with no repeated prefixes
Workloads with unpredictable prompt structures resulting in low hit rates
Best Practices
Place static and cacheable content first in prompts (system instructions, tools), with variable content last
Monitor cache hit rates and aim for rates above 50% for meaningful savings
Structure prompts to maximize shared prefix length
Choose Anthropic for explicit control and predictability; OpenAI for simplicity with automatic caching
Combine with RAG architectures for maximum cost reduction
9)
See Also
References