====== Prompt Caching ====== Prompt caching is a technique for large language models that stores prefixes of repeated prompts to avoid recomputation, significantly reducing both cost and latency on subsequent requests. It works by detecting matching initial prompt segments across API calls and reusing key-value cache computations from prior inferences instead of regenerating them. ((https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models|PromptHub: Prompt Caching with OpenAI, Anthropic and Google Models)) ===== How Prompt Caching Works ===== When an LLM processes a prompt, it computes internal representations (key-value pairs in the attention mechanism) for every token. If a subsequent request shares the same prefix, prompt caching allows the model to skip recomputing those representations and instead load them from a cache. The savings scale with the length of the shared prefix and the frequency of repeated requests. ((https://ngrok.com/blog/prompt-caching|ngrok: Prompt Caching)) ===== Anthropic Implementation ===== Anthropic provides explicit control over caching via the cache_control parameter with a value of {"type": "ephemeral"} to mark cacheable sections such as system prompts, large documents, or tool schemas. ((https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models|PromptHub: Prompt Caching Guide)) Key characteristics: * **Cache breakpoints**: Up to four per prompt, enabling independent caching of sections processed in order: tools, system messages, then user messages * **TTL**: Five minutes, refreshing on each cache hit * **Pricing**: Cached input tokens cost approximately 90% less than regular input tokens, though cache writes incur a small additional cost * **Metrics**: Responses include cache_creation_input_tokens (misses), cache_read_input_tokens (hits), and input_tokens for monitoring * **Minimum size**: Cacheable content must be at least 1,024 tokens for Claude 3.5 Sonnet and 2,048 tokens for Claude 3 Opus A common pattern is caching a large document in the system prompt while varying user messages across requests. ((https://www.digitalocean.com/blog/prompt-caching-with-digital-ocean|DigitalOcean: Prompt Caching)) ===== OpenAI Implementation ===== OpenAI uses automatic caching with no code changes needed for prompts of 1,024 tokens or more, checking prefixes in 128-token increments. ((https://blog.getbind.co/openai-prompt-caching-how-does-it-compare-to-claude-prompt-caching/|Bind: OpenAI vs Claude Prompt Caching)) Key characteristics: * **Automatic detection**: The system auto-detects matching prefixes with best-effort reuse * **Optional controls**: prompt_cache_key and prompt_cache_retention parameters for customization * **Pricing**: 50% discount on cached input tokens * **Metrics**: prompt_tokens_details.cached_tokens in API responses * **Cache window**: Short-lived, typically minutes In testing, repeating identical requests yields approximately 50% cache hit rates. ((https://ngrok.com/blog/prompt-caching|ngrok: Prompt Caching)) ===== Google Gemini Context Caching ===== Google Gemini supports context caching for repeated contexts like long documents, tracked via usage_metadata in responses. The implementation is automatic for Gemini models, with pricing aligned to Google token rates emphasizing reuse for production workloads. ((https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models|PromptHub: Prompt Caching Guide)) ===== Performance ===== * **Latency reduction**: Up to 80-85% reduction in time-to-first-token on cache hits for long prompts * **Cost savings**: 50-90% depending on provider and cache hit rate * **Cache hit rates**: Up to 100% with Anthropic explicit control; approximately 50% with OpenAI automatic approach Real-world deployments report 70-90% token cost reduction in systems with static prefixes combined with variable queries, such as RAG pipelines, tool-using agents, and chatbots with long system prompts. ((https://www.requesty.ai/blog/maximize-ai-efficiency-how-prompt-caching-cuts-costs-by-up-to-a-staggering-90|Requesty: Prompt Caching Cost Savings)) ===== When to Use ===== **Good candidates:** * High-volume applications with repeated static content (system prompts over 1,024 tokens, documents, conversation history, tool definitions) * Chatbots, RAG pipelines, and agent workflows with consistent prefixes * Scenarios where the same large context is queried repeatedly **Poor candidates:** * Short or highly variable prompts under 1,024 tokens * One-off requests with no repeated prefixes * Workloads with unpredictable prompt structures resulting in low hit rates ===== Best Practices ===== * Place static and cacheable content first in prompts (system instructions, tools), with variable content last * Monitor cache hit rates and aim for rates above 50% for meaningful savings * Structure prompts to maximize shared prefix length * Choose Anthropic for explicit control and predictability; OpenAI for simplicity with automatic caching * Combine with RAG architectures for maximum cost reduction ((https://www.digitalocean.com/blog/prompt-caching-with-digital-ocean|DigitalOcean: Prompt Caching)) ===== See Also ===== * [[tokenizer_comparison|Tokenizer Comparison]] * [[retrieval_strategies|Retrieval Strategies]] ===== References =====