====== Caching Strategies for Agents ======
Caching is the highest-ROI optimization for AI agents. By intercepting repeated or similar requests before they reach the LLM, production systems eliminate **20-45% of API calls** entirely. This guide covers every caching layer -- from exact-match to semantic similarity to tool result caching -- with real architecture patterns and benchmarks.(([[https://dev.to/kuldeep_paul/top-ai-gateways-with-semantic-caching-and-dynamic-routing-2026-guide-4a0g|Top AI Gateways with Semantic Caching]] - Dev.to (2026)))(([[https://levelup.gitconnected.com/burning-money-on-llms-heres-how-to-save-on-bills-with-caching-94f1bba3570b|How Semantic Caching Saves Thousands]] - Level Up Coding (2025)))
===== Why Caching Matters for Agents =====
Agents are expensive by nature: a single user query can trigger 3-8 LLM calls across planning, tool use, and synthesis steps. Without caching, identical or near-identical workflows execute from scratch every time. In production:((https://nordicapis.com/caching-strategies-for-ai-agent-traffic/|Caching Strategies for AI Agent Traffic - Nordic APIs (2025)))
* **30-50% of agent queries** are semantically similar to previous ones
* Each cached response saves $0.01-0.10 in API costs
* Cache hits return in **1-5ms** vs **1-10 seconds** for LLM calls
* At 100K requests/month, caching saves $500-3,000/month
===== The Caching Layer Stack =====
graph TB
A[User Query] --> B[Layer 1: Exact Match Cache]
B -->|HIT| Z[Return Cached Response]
B -->|MISS| C[Layer 2: Semantic Cache]
C -->|HIT| Z
C -->|MISS| D[Layer 3: KV Cache / Prefix Cache]
D --> E[LLM Inference]
E --> F[Layer 4: Tool Result Cache]
F --> G[Agent Response]
G --> H[Store in Cache Layers]
H --> Z
===== Layer 1: Exact-Match Caching =====
The simplest and fastest cache layer. Hash the prompt and check for identical matches.
**Performance:** Near-zero overhead, instant retrieval, 100% precision on hits.
**Limitation:** Misses paraphrases entirely. "What is the weather?" and "Tell me the weather" are cache misses.
^ Metric ^ Value ^
| Lookup latency | < 1ms |
| Hit rate (typical) | 5-15% for diverse queries, 30-60% for structured queries |
| Precision | 100% (exact match only) |
| Storage cost | Minimal (hash + response) |
import hashlib
import json
import redis
class ExactMatchCache:
def __init__(self, redis_url="redis://localhost:6379", ttl=3600):
self.client = redis.from_url(redis_url)
self.ttl = ttl
def _hash_key(self, prompt: str, model: str) -> str:
content = json.dumps({"prompt": prompt, "model": model}, sort_keys=True)
return f"llm:exact:{hashlib.sha256(content.encode()).hexdigest()}"
def get(self, prompt: str, model: str) -> str | None:
key = self._hash_key(prompt, model)
result = self.client.get(key)
return result.decode() if result else None
def set(self, prompt: str, model: str, response: str):
key = self._hash_key(prompt, model)
self.client.setex(key, self.ttl, response)
===== Layer 2: Semantic Caching =====
The most impactful cache layer for agents. Uses embedding similarity to match semantically equivalent queries, even with different wording.((https://redis.io/docs/latest/develop/ai/redisvl/0.7.0/user_guide/llmcache/|Semantic Caching for LLMs - Redis Documentation (2026)))
**Production benchmarks:**
* **20-40% hit rate** in AI gateways (Bifrost, LiteLLM, Kong)
* **Similarity threshold:** 0.80-0.85 cosine similarity is the sweet spot
* **Embedding overhead:** ~11 microseconds per query (Bifrost benchmark)
* **Cache lookup:** 1-5ms via Redis vector search
**How it works:**
- Compute embedding of incoming query
- Search vector index for similar cached queries (cosine similarity > threshold)
- If match found, return cached response
- If no match, call LLM, cache query embedding + response
from redisvl.extensions.llmcache import SemanticCache
class AgentSemanticCache:
def __init__(self, redis_url="redis://localhost:6379", threshold=0.15):
self.cache = SemanticCache(
name="agent_cache",
redis_url=redis_url,
distance_threshold=threshold,
)
self.stats = {"hits": 0, "misses": 0}
def query(self, prompt: str) -> dict:
results = self.cache.check(prompt=prompt)
if results:
self.stats["hits"] += 1
return {"source": "cache", "response": results[0]["response"]}
self.stats["misses"] += 1
return {"source": "miss", "response": None}
def store(self, prompt: str, response: str, metadata: dict = None):
self.cache.store(
prompt=prompt,
response=response,
metadata=metadata or {}
)
@property
def hit_rate(self) -> float:
total = self.stats["hits"] + self.stats["misses"]
return self.stats["hits"] / total if total > 0 else 0.0
# Usage
cache = AgentSemanticCache(threshold=0.15)
result = cache.query("What is the capital of France?")
if result["source"] == "miss":
llm_response = call_llm("What is the capital of France?")
cache.store("What is the capital of France?", llm_response)
# Later: "Tell me France's capital city" -> cache HIT (semantic match)
===== Layer 3: KV Cache and Prefix Caching =====
Operates at the inference engine level, not the application level. Caches intermediate computation states.
**Prefix caching** reuses KV states when multiple requests share a common prefix (e.g., system prompt). Supported by vLLM, SGLang, and Anthropic's API.
^ Provider/Engine ^ Feature ^ Savings ^
| Anthropic API | Prompt caching | 90% off cached input tokens |
| OpenAI API | Automatic prefix caching | 50% off cached input tokens |
| vLLM | --enable-prefix-caching | Eliminates recomputation of shared prefixes |
| SGLang | RadixAttention | Automatic prefix tree caching |
**Anthropic prompt caching example:** A 10K-token system prompt cached across requests costs $0.30/M tokens instead of $3.00/M (90% savings on those tokens).
===== Layer 4: Tool Result Caching =====
Agents call external tools (APIs, databases, search engines) that are often slow and rate-limited. Cache these results independently.
import hashlib
import json
import time
import redis
class ToolResultCache:
def __init__(self, redis_url="redis://localhost:6379"):
self.client = redis.from_url(redis_url)
# TTL per tool type: volatile data gets shorter TTL
self.ttl_config = {
"web_search": 3600, # 1 hour
"database_query": 300, # 5 min
"weather_api": 1800, # 30 min
"static_lookup": 86400, # 24 hours
"calculation": 604800, # 7 days - deterministic
}
def _cache_key(self, tool_name: str, args: dict) -> str:
content = json.dumps({"tool": tool_name, "args": args}, sort_keys=True)
return f"tool:{hashlib.sha256(content.encode()).hexdigest()}"
def get(self, tool_name: str, args: dict) -> dict | None:
key = self._cache_key(tool_name, args)
result = self.client.get(key)
if result:
data = json.loads(result)
age = time.time() - data["cached_at"]
return {"result": data["result"], "cached": True, "age_seconds": age}
return None
def set(self, tool_name: str, args: dict, result):
key = self._cache_key(tool_name, args)
ttl = self.ttl_config.get(tool_name, 3600)
data = json.dumps({"result": result, "cached_at": time.time()})
self.client.setex(key, ttl, data)
===== Layer 5: Embedding Cache =====
If your agent generates embeddings for RAG or semantic search, cache them to avoid recomputation.
* Embedding generation costs $0.02-0.13 per million tokens
* Computation takes 50-200ms per batch
* Cache embeddings keyed by content hash
* Invalidate only when source content changes
===== Exact-Match vs Semantic Cache: When to Use Which =====
^ Criteria ^ Exact-Match ^ Semantic ^
| Query diversity | Low (templated, structured) | High (natural language, varied) |
| Precision requirement | Must be 100% | 95%+ acceptable |
| Latency budget | < 1ms | 1-5ms |
| Setup complexity | Simple (hash + KV store) | Medium (embeddings + vector DB) |
| Typical hit rate | 5-60% | 20-40% |
| Best for | API tools, structured queries | User-facing chat, search |
**Recommendation:** Use both layers. Exact-match as L1 (fast, precise), semantic as L2 (catches paraphrases).((https://redis.io/blog/10-techniques-for-semantic-cache-optimization/|10 Techniques for Semantic Cache Optimization - Redis Blog (2025)))
===== Tuning Semantic Cache Thresholds =====
The similarity threshold controls the tradeoff between hit rate and accuracy:
^ Threshold (cosine) ^ Hit Rate ^ False Positive Risk ^ Use Case ^
| > 0.95 | Low (5-10%) | Very low | High-stakes (medical, legal) |
| 0.85-0.95 | Medium (15-25%) | Low | General Q&A |
| 0.80-0.85 | High (25-40%) | Moderate | Customer support, FAQs |
| < 0.80 | Very high (40%+) | High | Only for non-critical, high-volume |
===== Production Architecture =====
graph LR
A[Agent Query] --> B[Exact Match L1]
B -->|MISS| C[Semantic Cache L2]
C -->|MISS| D[LLM with Prefix Cache L3]
D --> E[Tool Calls]
E --> F[Tool Result Cache L4]
F --> G[Response]
G -->|Store| B
G -->|Store| C
B -->|HIT 10%| H[Response under 1ms]
C -->|HIT 25%| I[Response in 2-5ms]
F -->|HIT 40%| J[Skip API Call]
**Combined cache hit rates from production:**
* L1 exact-match: 10-15% of total requests
* L2 semantic: 20-30% of remaining requests
* L4 tool results: 30-50% of tool calls avoided
* **Net effect:** 40-60% of requests never reach the LLM
===== Monitoring and Invalidation =====
Critical metrics to track:
* **Hit rate per layer** - target: >20% for semantic, >5% for exact
* **False positive rate** - sample and verify cached responses weekly
* **Cache staleness** - set TTLs appropriate to data volatility
* **Memory usage** - monitor Redis memory, set eviction policies (allkeys-lru)
* **Cost savings** - track (cache_hits * avg_api_cost) monthly
===== See Also =====
* [[how_to_reduce_token_costs|How to Reduce Token Costs]]
* [[how_to_speed_up_agents|How to Speed Up Agents]]
* [[what_is_an_ai_agent|What is an AI Agent]]
===== References =====