Caching is the highest-ROI optimization for AI agents. By intercepting repeated or similar requests before they reach the LLM, production systems eliminate 20-45% of API calls entirely. This guide covers every caching layer – from exact-match to semantic similarity to tool result caching – with real architecture patterns and benchmarks.1))2))
Agents are expensive by nature: a single user query can trigger 3-8 LLM calls across planning, tool use, and synthesis steps. Without caching, identical or near-identical workflows execute from scratch every time. In production:3))
The simplest and fastest cache layer. Hash the prompt and check for identical matches.
Performance: Near-zero overhead, instant retrieval, 100% precision on hits.
Limitation: Misses paraphrases entirely. “What is the weather?” and “Tell me the weather” are cache misses.
| Metric | Value |
|---|---|
| Lookup latency | < 1ms |
| Hit rate (typical) | 5-15% for diverse queries, 30-60% for structured queries |
| Precision | 100% (exact match only) |
| Storage cost | Minimal (hash + response) |
import hashlib import json import redis class ExactMatchCache: def __init__(self, redis_url="redis://localhost:6379", ttl=3600): self.client = redis.from_url(redis_url) self.ttl = ttl def _hash_key(self, prompt: str, model: str) -> str: content = json.dumps({"prompt": prompt, "model": model}, sort_keys=True) return f"llm:exact:{hashlib.sha256(content.encode()).hexdigest()}" def get(self, prompt: str, model: str) -> str | None: key = self._hash_key(prompt, model) result = self.client.get(key) return result.decode() if result else None def set(self, prompt: str, model: str, response: str): key = self._hash_key(prompt, model) self.client.setex(key, self.ttl, response)
The most impactful cache layer for agents. Uses embedding similarity to match semantically equivalent queries, even with different wording.4))
Production benchmarks:
How it works:
from redisvl.extensions.llmcache import SemanticCache class AgentSemanticCache: def __init__(self, redis_url="redis://localhost:6379", threshold=0.15): self.cache = SemanticCache( name="agent_cache", redis_url=redis_url, distance_threshold=threshold, ) self.stats = {"hits": 0, "misses": 0} def query(self, prompt: str) -> dict: results = self.cache.check(prompt=prompt) if results: self.stats["hits"] += 1 return {"source": "cache", "response": results[0]["response"]} self.stats["misses"] += 1 return {"source": "miss", "response": None} def store(self, prompt: str, response: str, metadata: dict = None): self.cache.store( prompt=prompt, response=response, metadata=metadata or {} ) @property def hit_rate(self) -> float: total = self.stats["hits"] + self.stats["misses"] return self.stats["hits"] / total if total > 0 else 0.0 # Usage cache = AgentSemanticCache(threshold=0.15) result = cache.query("What is the capital of France?") if result["source"] == "miss": llm_response = call_llm("What is the capital of France?") cache.store("What is the capital of France?", llm_response) # Later: "Tell me France's capital city" -> cache HIT (semantic match)
Operates at the inference engine level, not the application level. Caches intermediate computation states.
Prefix caching reuses KV states when multiple requests share a common prefix (e.g., system prompt). Supported by vLLM, SGLang, and Anthropic's API.
| Provider/Engine | Feature | Savings |
|---|---|---|
| Anthropic API | Prompt caching | 90% off cached input tokens |
| OpenAI API | Automatic prefix caching | 50% off cached input tokens |
| vLLM | –enable-prefix-caching | Eliminates recomputation of shared prefixes |
| SGLang | RadixAttention | Automatic prefix tree caching |
Anthropic prompt caching example: A 10K-token system prompt cached across requests costs $0.30/M tokens instead of $3.00/M (90% savings on those tokens).
Agents call external tools (APIs, databases, search engines) that are often slow and rate-limited. Cache these results independently.
import hashlib import json import time import redis class ToolResultCache: def __init__(self, redis_url="redis://localhost:6379"): self.client = redis.from_url(redis_url) # TTL per tool type: volatile data gets shorter TTL self.ttl_config = { "web_search": 3600, # 1 hour "database_query": 300, # 5 min "weather_api": 1800, # 30 min "static_lookup": 86400, # 24 hours "calculation": 604800, # 7 days - deterministic } def _cache_key(self, tool_name: str, args: dict) -> str: content = json.dumps({"tool": tool_name, "args": args}, sort_keys=True) return f"tool:{hashlib.sha256(content.encode()).hexdigest()}" def get(self, tool_name: str, args: dict) -> dict | None: key = self._cache_key(tool_name, args) result = self.client.get(key) if result: data = json.loads(result) age = time.time() - data["cached_at"] return {"result": data["result"], "cached": True, "age_seconds": age} return None def set(self, tool_name: str, args: dict, result): key = self._cache_key(tool_name, args) ttl = self.ttl_config.get(tool_name, 3600) data = json.dumps({"result": result, "cached_at": time.time()}) self.client.setex(key, ttl, data)
If your agent generates embeddings for RAG or semantic search, cache them to avoid recomputation.
| Criteria | Exact-Match | Semantic |
|---|---|---|
| Query diversity | Low (templated, structured) | High (natural language, varied) |
| Precision requirement | Must be 100% | 95%+ acceptable |
| Latency budget | < 1ms | 1-5ms |
| Setup complexity | Simple (hash + KV store) | Medium (embeddings + vector DB) |
| Typical hit rate | 5-60% | 20-40% |
| Best for | API tools, structured queries | User-facing chat, search |
Recommendation: Use both layers. Exact-match as L1 (fast, precise), semantic as L2 (catches paraphrases).5))
The similarity threshold controls the tradeoff between hit rate and accuracy:
| Threshold (cosine) | Hit Rate | False Positive Risk | Use Case |
|---|---|---|---|
| > 0.95 | Low (5-10%) | Very low | High-stakes (medical, legal) |
| 0.85-0.95 | Medium (15-25%) | Low | General Q&A |
| 0.80-0.85 | High (25-40%) | Moderate | Customer support, FAQs |
| < 0.80 | Very high (40%+) | High | Only for non-critical, high-volume |
Combined cache hit rates from production:
Critical metrics to track: