====== Caching Strategies for Agents ====== Caching is the highest-ROI optimization for AI agents. By intercepting repeated or similar requests before they reach the LLM, production systems eliminate **20-45% of API calls** entirely. This guide covers every caching layer -- from exact-match to semantic similarity to tool result caching -- with real architecture patterns and benchmarks.(([[https://dev.to/kuldeep_paul/top-ai-gateways-with-semantic-caching-and-dynamic-routing-2026-guide-4a0g|Top AI Gateways with Semantic Caching]] - Dev.to (2026)))(([[https://levelup.gitconnected.com/burning-money-on-llms-heres-how-to-save-on-bills-with-caching-94f1bba3570b|How Semantic Caching Saves Thousands]] - Level Up Coding (2025))) ===== Why Caching Matters for Agents ===== Agents are expensive by nature: a single user query can trigger 3-8 LLM calls across planning, tool use, and synthesis steps. Without caching, identical or near-identical workflows execute from scratch every time. In production:((https://nordicapis.com/caching-strategies-for-ai-agent-traffic/|Caching Strategies for AI Agent Traffic - Nordic APIs (2025))) * **30-50% of agent queries** are semantically similar to previous ones * Each cached response saves $0.01-0.10 in API costs * Cache hits return in **1-5ms** vs **1-10 seconds** for LLM calls * At 100K requests/month, caching saves $500-3,000/month ===== The Caching Layer Stack ===== graph TB A[User Query] --> B[Layer 1: Exact Match Cache] B -->|HIT| Z[Return Cached Response] B -->|MISS| C[Layer 2: Semantic Cache] C -->|HIT| Z C -->|MISS| D[Layer 3: KV Cache / Prefix Cache] D --> E[LLM Inference] E --> F[Layer 4: Tool Result Cache] F --> G[Agent Response] G --> H[Store in Cache Layers] H --> Z ===== Layer 1: Exact-Match Caching ===== The simplest and fastest cache layer. Hash the prompt and check for identical matches. **Performance:** Near-zero overhead, instant retrieval, 100% precision on hits. **Limitation:** Misses paraphrases entirely. "What is the weather?" and "Tell me the weather" are cache misses. ^ Metric ^ Value ^ | Lookup latency | < 1ms | | Hit rate (typical) | 5-15% for diverse queries, 30-60% for structured queries | | Precision | 100% (exact match only) | | Storage cost | Minimal (hash + response) | import hashlib import json import redis class ExactMatchCache: def __init__(self, redis_url="redis://localhost:6379", ttl=3600): self.client = redis.from_url(redis_url) self.ttl = ttl def _hash_key(self, prompt: str, model: str) -> str: content = json.dumps({"prompt": prompt, "model": model}, sort_keys=True) return f"llm:exact:{hashlib.sha256(content.encode()).hexdigest()}" def get(self, prompt: str, model: str) -> str | None: key = self._hash_key(prompt, model) result = self.client.get(key) return result.decode() if result else None def set(self, prompt: str, model: str, response: str): key = self._hash_key(prompt, model) self.client.setex(key, self.ttl, response) ===== Layer 2: Semantic Caching ===== The most impactful cache layer for agents. Uses embedding similarity to match semantically equivalent queries, even with different wording.((https://redis.io/docs/latest/develop/ai/redisvl/0.7.0/user_guide/llmcache/|Semantic Caching for LLMs - Redis Documentation (2026))) **Production benchmarks:** * **20-40% hit rate** in AI gateways (Bifrost, LiteLLM, Kong) * **Similarity threshold:** 0.80-0.85 cosine similarity is the sweet spot * **Embedding overhead:** ~11 microseconds per query (Bifrost benchmark) * **Cache lookup:** 1-5ms via Redis vector search **How it works:** - Compute embedding of incoming query - Search vector index for similar cached queries (cosine similarity > threshold) - If match found, return cached response - If no match, call LLM, cache query embedding + response from redisvl.extensions.llmcache import SemanticCache class AgentSemanticCache: def __init__(self, redis_url="redis://localhost:6379", threshold=0.15): self.cache = SemanticCache( name="agent_cache", redis_url=redis_url, distance_threshold=threshold, ) self.stats = {"hits": 0, "misses": 0} def query(self, prompt: str) -> dict: results = self.cache.check(prompt=prompt) if results: self.stats["hits"] += 1 return {"source": "cache", "response": results[0]["response"]} self.stats["misses"] += 1 return {"source": "miss", "response": None} def store(self, prompt: str, response: str, metadata: dict = None): self.cache.store( prompt=prompt, response=response, metadata=metadata or {} ) @property def hit_rate(self) -> float: total = self.stats["hits"] + self.stats["misses"] return self.stats["hits"] / total if total > 0 else 0.0 # Usage cache = AgentSemanticCache(threshold=0.15) result = cache.query("What is the capital of France?") if result["source"] == "miss": llm_response = call_llm("What is the capital of France?") cache.store("What is the capital of France?", llm_response) # Later: "Tell me France's capital city" -> cache HIT (semantic match) ===== Layer 3: KV Cache and Prefix Caching ===== Operates at the inference engine level, not the application level. Caches intermediate computation states. **Prefix caching** reuses KV states when multiple requests share a common prefix (e.g., system prompt). Supported by vLLM, SGLang, and Anthropic's API. ^ Provider/Engine ^ Feature ^ Savings ^ | Anthropic API | Prompt caching | 90% off cached input tokens | | OpenAI API | Automatic prefix caching | 50% off cached input tokens | | vLLM | --enable-prefix-caching | Eliminates recomputation of shared prefixes | | SGLang | RadixAttention | Automatic prefix tree caching | **Anthropic prompt caching example:** A 10K-token system prompt cached across requests costs $0.30/M tokens instead of $3.00/M (90% savings on those tokens). ===== Layer 4: Tool Result Caching ===== Agents call external tools (APIs, databases, search engines) that are often slow and rate-limited. Cache these results independently. import hashlib import json import time import redis class ToolResultCache: def __init__(self, redis_url="redis://localhost:6379"): self.client = redis.from_url(redis_url) # TTL per tool type: volatile data gets shorter TTL self.ttl_config = { "web_search": 3600, # 1 hour "database_query": 300, # 5 min "weather_api": 1800, # 30 min "static_lookup": 86400, # 24 hours "calculation": 604800, # 7 days - deterministic } def _cache_key(self, tool_name: str, args: dict) -> str: content = json.dumps({"tool": tool_name, "args": args}, sort_keys=True) return f"tool:{hashlib.sha256(content.encode()).hexdigest()}" def get(self, tool_name: str, args: dict) -> dict | None: key = self._cache_key(tool_name, args) result = self.client.get(key) if result: data = json.loads(result) age = time.time() - data["cached_at"] return {"result": data["result"], "cached": True, "age_seconds": age} return None def set(self, tool_name: str, args: dict, result): key = self._cache_key(tool_name, args) ttl = self.ttl_config.get(tool_name, 3600) data = json.dumps({"result": result, "cached_at": time.time()}) self.client.setex(key, ttl, data) ===== Layer 5: Embedding Cache ===== If your agent generates embeddings for RAG or semantic search, cache them to avoid recomputation. * Embedding generation costs $0.02-0.13 per million tokens * Computation takes 50-200ms per batch * Cache embeddings keyed by content hash * Invalidate only when source content changes ===== Exact-Match vs Semantic Cache: When to Use Which ===== ^ Criteria ^ Exact-Match ^ Semantic ^ | Query diversity | Low (templated, structured) | High (natural language, varied) | | Precision requirement | Must be 100% | 95%+ acceptable | | Latency budget | < 1ms | 1-5ms | | Setup complexity | Simple (hash + KV store) | Medium (embeddings + vector DB) | | Typical hit rate | 5-60% | 20-40% | | Best for | API tools, structured queries | User-facing chat, search | **Recommendation:** Use both layers. Exact-match as L1 (fast, precise), semantic as L2 (catches paraphrases).((https://redis.io/blog/10-techniques-for-semantic-cache-optimization/|10 Techniques for Semantic Cache Optimization - Redis Blog (2025))) ===== Tuning Semantic Cache Thresholds ===== The similarity threshold controls the tradeoff between hit rate and accuracy: ^ Threshold (cosine) ^ Hit Rate ^ False Positive Risk ^ Use Case ^ | > 0.95 | Low (5-10%) | Very low | High-stakes (medical, legal) | | 0.85-0.95 | Medium (15-25%) | Low | General Q&A | | 0.80-0.85 | High (25-40%) | Moderate | Customer support, FAQs | | < 0.80 | Very high (40%+) | High | Only for non-critical, high-volume | ===== Production Architecture ===== graph LR A[Agent Query] --> B[Exact Match L1] B -->|MISS| C[Semantic Cache L2] C -->|MISS| D[LLM with Prefix Cache L3] D --> E[Tool Calls] E --> F[Tool Result Cache L4] F --> G[Response] G -->|Store| B G -->|Store| C B -->|HIT 10%| H[Response under 1ms] C -->|HIT 25%| I[Response in 2-5ms] F -->|HIT 40%| J[Skip API Call] **Combined cache hit rates from production:** * L1 exact-match: 10-15% of total requests * L2 semantic: 20-30% of remaining requests * L4 tool results: 30-50% of tool calls avoided * **Net effect:** 40-60% of requests never reach the LLM ===== Monitoring and Invalidation ===== Critical metrics to track: * **Hit rate per layer** - target: >20% for semantic, >5% for exact * **False positive rate** - sample and verify cached responses weekly * **Cache staleness** - set TTLs appropriate to data volatility * **Memory usage** - monitor Redis memory, set eviction policies (allkeys-lru) * **Cost savings** - track (cache_hits * avg_api_cost) monthly ===== See Also ===== * [[how_to_reduce_token_costs|How to Reduce Token Costs]] * [[how_to_speed_up_agents|How to Speed Up Agents]] * [[what_is_an_ai_agent|What is an AI Agent]] ===== References =====