Table of Contents

Prompt Caching

Prompt caching is a technique for large language models that stores prefixes of repeated prompts to avoid recomputation, significantly reducing both cost and latency on subsequent requests. It works by detecting matching initial prompt segments across API calls and reusing key-value cache computations from prior inferences instead of regenerating them. 1)

How Prompt Caching Works

When an LLM processes a prompt, it computes internal representations (key-value pairs in the attention mechanism) for every token. If a subsequent request shares the same prefix, prompt caching allows the model to skip recomputing those representations and instead load them from a cache. The savings scale with the length of the shared prefix and the frequency of repeated requests. 2)

Anthropic Implementation

Anthropic provides explicit control over caching via the cache_control parameter with a value of {“type”: “ephemeral”} to mark cacheable sections such as system prompts, large documents, or tool schemas. 3)

Key characteristics:

A common pattern is caching a large document in the system prompt while varying user messages across requests. 4)

OpenAI Implementation

OpenAI uses automatic caching with no code changes needed for prompts of 1,024 tokens or more, checking prefixes in 128-token increments. 5)

Key characteristics:

In testing, repeating identical requests yields approximately 50% cache hit rates. 6)

Google Gemini Context Caching

Google Gemini supports context caching for repeated contexts like long documents, tracked via usage_metadata in responses. The implementation is automatic for Gemini models, with pricing aligned to Google token rates emphasizing reuse for production workloads. 7)

Performance

Real-world deployments report 70-90% token cost reduction in systems with static prefixes combined with variable queries, such as RAG pipelines, tool-using agents, and chatbots with long system prompts. 8)

When to Use

Good candidates:

Poor candidates:

Best Practices

See Also

References

1)
https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models|PromptHub: Prompt Caching with OpenAI, Anthropic and Google Models