Why Caching Matters for Agents
The Caching Layer Stack
Layer 1: Exact-Match Caching
Layer 2: Semantic Caching
Layer 3: KV Cache and Prefix Caching
Layer 4: Tool Result Caching
Layer 5: Embedding Cache
Exact-Match vs Semantic Cache: When to Use Which
Tuning Semantic Cache Thresholds
Production Architecture
Monitoring and Invalidation
See Also
References

Caching Strategies for Agents

Caching is the highest-ROI optimization for AI agents. By intercepting repeated or similar requests before they reach the LLM, production systems eliminate 20-45% of API calls entirely. This guide covers every caching layer – from exact-match to semantic similarity to tool result caching – with real architecture patterns and benchmarks.¹⁾)²⁾)

Why Caching Matters for Agents

Agents are expensive by nature: a single user query can trigger 3-8 LLM calls across planning, tool use, and synthesis steps. Without caching, identical or near-identical workflows execute from scratch every time. In production:³⁾)

30-50% of agent queries are semantically similar to previous ones
Each cached response saves $0.01-0.10 in API costs
Cache hits return in 1-5ms vs 1-10 seconds for LLM calls
At 100K requests/month, caching saves $500-3,000/month

The Caching Layer Stack

Layer 1: Exact-Match Caching

The simplest and fastest cache layer. Hash the prompt and check for identical matches.

Performance: Near-zero overhead, instant retrieval, 100% precision on hits.

Limitation: Misses paraphrases entirely. “What is the weather?” and “Tell me the weather” are cache misses.

Metric	Value
Lookup latency	< 1ms
Hit rate (typical)	5-15% for diverse queries, 30-60% for structured queries
Precision	100% (exact match only)
Storage cost	Minimal (hash + response)

import hashlib
import json
import redis
 
class ExactMatchCache:
    def __init__(self, redis_url="redis://localhost:6379", ttl=3600):
        self.client = redis.from_url(redis_url)
        self.ttl = ttl
 
    def _hash_key(self, prompt: str, model: str) -> str:
        content = json.dumps({"prompt": prompt, "model": model}, sort_keys=True)
        return f"llm:exact:{hashlib.sha256(content.encode()).hexdigest()}"
 
    def get(self, prompt: str, model: str) -> str | None:
        key = self._hash_key(prompt, model)
        result = self.client.get(key)
        return result.decode() if result else None
 
    def set(self, prompt: str, model: str, response: str):
        key = self._hash_key(prompt, model)
        self.client.setex(key, self.ttl, response)

Layer 2: Semantic Caching

The most impactful cache layer for agents. Uses embedding similarity to match semantically equivalent queries, even with different wording.⁴⁾)

Production benchmarks:

20-40% hit rate in AI gateways (Bifrost, LiteLLM, Kong)
Similarity threshold: 0.80-0.85 cosine similarity is the sweet spot
Embedding overhead: ~11 microseconds per query (Bifrost benchmark)
Cache lookup: 1-5ms via Redis vector search

How it works:

Compute embedding of incoming query
Search vector index for similar cached queries (cosine similarity > threshold)
If match found, return cached response
If no match, call LLM, cache query embedding + response

from redisvl.extensions.llmcache import SemanticCache
 
class AgentSemanticCache:
    def __init__(self, redis_url="redis://localhost:6379", threshold=0.15):
        self.cache = SemanticCache(
            name="agent_cache",
            redis_url=redis_url,
            distance_threshold=threshold,
        )
        self.stats = {"hits": 0, "misses": 0}
 
    def query(self, prompt: str) -> dict:
        results = self.cache.check(prompt=prompt)
        if results:
            self.stats["hits"] += 1
            return {"source": "cache", "response": results[0]["response"]}
 
        self.stats["misses"] += 1
        return {"source": "miss", "response": None}
 
    def store(self, prompt: str, response: str, metadata: dict = None):
        self.cache.store(
            prompt=prompt,
            response=response,
            metadata=metadata or {}
        )
 
    @property
    def hit_rate(self) -> float:
        total = self.stats["hits"] + self.stats["misses"]
        return self.stats["hits"] / total if total > 0 else 0.0
 
# Usage
cache = AgentSemanticCache(threshold=0.15)
result = cache.query("What is the capital of France?")
if result["source"] == "miss":
    llm_response = call_llm("What is the capital of France?")
    cache.store("What is the capital of France?", llm_response)
# Later: "Tell me France's capital city" -> cache HIT (semantic match)

Layer 3: KV Cache and Prefix Caching

Operates at the inference engine level, not the application level. Caches intermediate computation states.

Prefix caching reuses KV states when multiple requests share a common prefix (e.g., system prompt). Supported by vLLM, SGLang, and Anthropic's API.

Provider/Engine	Feature	Savings
Anthropic API	Prompt caching	90% off cached input tokens
OpenAI API	Automatic prefix caching	50% off cached input tokens
vLLM	–enable-prefix-caching	Eliminates recomputation of shared prefixes
SGLang	RadixAttention	Automatic prefix tree caching

Anthropic prompt caching example: A 10K-token system prompt cached across requests costs $0.30/M tokens instead of $3.00/M (90% savings on those tokens).

Layer 4: Tool Result Caching

Agents call external tools (APIs, databases, search engines) that are often slow and rate-limited. Cache these results independently.

import hashlib
import json
import time
import redis
 
class ToolResultCache:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.client = redis.from_url(redis_url)
        # TTL per tool type: volatile data gets shorter TTL
        self.ttl_config = {
            "web_search": 3600,       # 1 hour
            "database_query": 300,     # 5 min
            "weather_api": 1800,       # 30 min
            "static_lookup": 86400,    # 24 hours
            "calculation": 604800,     # 7 days - deterministic
        }
 
    def _cache_key(self, tool_name: str, args: dict) -> str:
        content = json.dumps({"tool": tool_name, "args": args}, sort_keys=True)
        return f"tool:{hashlib.sha256(content.encode()).hexdigest()}"
 
    def get(self, tool_name: str, args: dict) -> dict | None:
        key = self._cache_key(tool_name, args)
        result = self.client.get(key)
        if result:
            data = json.loads(result)
            age = time.time() - data["cached_at"]
            return {"result": data["result"], "cached": True, "age_seconds": age}
        return None
 
    def set(self, tool_name: str, args: dict, result):
        key = self._cache_key(tool_name, args)
        ttl = self.ttl_config.get(tool_name, 3600)
        data = json.dumps({"result": result, "cached_at": time.time()})
        self.client.setex(key, ttl, data)

Layer 5: Embedding Cache

If your agent generates embeddings for RAG or semantic search, cache them to avoid recomputation.

Embedding generation costs $0.02-0.13 per million tokens
Computation takes 50-200ms per batch
Cache embeddings keyed by content hash
Invalidate only when source content changes

Exact-Match vs Semantic Cache: When to Use Which

Criteria	Exact-Match	Semantic
Query diversity	Low (templated, structured)	High (natural language, varied)
Precision requirement	Must be 100%	95%+ acceptable
Latency budget	< 1ms	1-5ms
Setup complexity	Simple (hash + KV store)	Medium (embeddings + vector DB)
Typical hit rate	5-60%	20-40%
Best for	API tools, structured queries	User-facing chat, search

Recommendation: Use both layers. Exact-match as L1 (fast, precise), semantic as L2 (catches paraphrases).⁵⁾)

Tuning Semantic Cache Thresholds

The similarity threshold controls the tradeoff between hit rate and accuracy:

Threshold (cosine)	Hit Rate	False Positive Risk	Use Case
> 0.95	Low (5-10%)	Very low	High-stakes (medical, legal)
0.85-0.95	Medium (15-25%)	Low	General Q&A
0.80-0.85	High (25-40%)	Moderate	Customer support, FAQs
< 0.80	Very high (40%+)	High	Only for non-critical, high-volume

Production Architecture

Combined cache hit rates from production:

L1 exact-match: 10-15% of total requests
L2 semantic: 20-30% of remaining requests
L4 tool results: 30-50% of tool calls avoided
Net effect: 40-60% of requests never reach the LLM

Monitoring and Invalidation

Critical metrics to track:

Hit rate per layer - target: >20% for semantic, >5% for exact
False positive rate - sample and verify cached responses weekly
Cache staleness - set TTLs appropriate to data volatility
Memory usage - monitor Redis memory, set eviction policies (allkeys-lru)
Cost savings - track (cache_hits * avg_api_cost) monthly

References

¹⁾

Top AI Gateways with Semantic Caching - Dev.to (2026

²⁾

How Semantic Caching Saves Thousands - Level Up Coding (2025

³⁾

https://nordicapis.com/caching-strategies-for-ai-agent-traffic/|Caching Strategies for AI Agent Traffic - Nordic APIs (2025

⁴⁾

https://redis.io/docs/latest/develop/ai/redisvl/0.7.0/user_guide/llmcache/|Semantic Caching for LLMs - Redis Documentation (2026

⁵⁾

https://redis.io/blog/10-techniques-for-semantic-cache-optimization/|10 Techniques for Semantic Cache Optimization - Redis Blog (2025

Table of Contents