Why Caching Matters for Agents
The Caching Layer Stack
Layer 1: Exact-Match Caching
Layer 2: Semantic Caching
Layer 3: KV Cache and Prefix Caching
Layer 4: Tool Result Caching
Layer 5: Embedding Cache
Exact-Match vs Semantic Cache: When to Use Which
Tuning Semantic Cache Thresholds
Production Architecture
Monitoring and Invalidation
See Also
References

Caching Strategies for Agents

Caching is the highest-ROI optimization for AI agents. By intercepting repeated or similar requests before they reach the LLM, production systems eliminate 20-45% of API calls entirely. This guide covers every caching layer – from exact-match to semantic similarity to tool result caching – with real architecture patterns and benchmarks.¹⁾)²⁾)

Why Caching Matters for Agents

Agents are expensive by nature: a single user query can trigger 3-8 LLM calls across planning, tool use, and synthesis steps. Without caching, identical or near-identical workflows execute from scratch every time. In production:³⁾)

30-50% of agent queries are semantically similar to previous ones
Each cached response saves $0.01-0.10 in API costs
Cache hits return in 1-5ms vs 1-10 seconds for LLM calls
At 100K requests/month, caching saves $500-3,000/month

The Caching Layer Stack

Layer 1: Exact-Match Caching

The simplest and fastest cache layer. Hash the prompt and check for identical matches.

Performance: Near-zero overhead, instant retrieval, 100% precision on hits.

Limitation: Misses paraphrases entirely. “What is the weather?” and “Tell me the weather” are cache misses.

Metric	Value
Lookup latency	< 1ms
Hit rate (typical)	5-15% for diverse queries, 30-60% for structured queries
Precision	100% (exact match only)
Storage cost	Minimal (hash + response)

import hashlib
import json
import redis
 
class ExactMatchCache:
    def __init__(self, redis_url="redis://localhost:6379", ttl=3600):
        self.client = redis.from_url(redis_url)
        self.ttl = ttl
 
    def _hash_key(self, prompt: str, model: str) -> str:
        content = json.dumps({"prompt": prompt, "model": model}, sort_keys=True)
        return f"llm:exact:{hashlib.sha256(content.encode()).hexdigest()}"
 
    def get(self, prompt: str, model: str) -> str | None:
        key = self._hash_key(prompt, model)
        result = self.client.get(key)
        return result.decode() if result else None
 
    def set(self, prompt: str, model: str, response: str):
        key = self._hash_key(prompt, model)
        self.client.setex(key, self.ttl, response)

Layer 2: Semantic Caching

The most impactful cache layer for agents. Uses embedding similarity to match semantically equivalent queries, even with different wording.⁴⁾)

Production benchmarks:

20-40% hit rate in AI gateways (Bifrost, LiteLLM, Kong)
Similarity threshold: 0.80-0.85 cosine similarity is the sweet spot
Embedding overhead: ~11 microseconds per query (Bifrost benchmark)
Cache lookup: 1-5ms via Redis vector search

How it works:

Compute embedding of incoming query
Search vector index for similar cached queries (cosine similarity > threshold)
If match found, return cached response
If no match, call LLM, cache query embedding + response

from redisvl.extensions.llmcache import SemanticCache
 
class AgentSemanticCache:
    def __init__(self, redis_url="redis://localhost:6379", threshold=0.15):
        self.cache = SemanticCache(
            name="agent_cache",
            redis_url=redis_url,
            distance_threshold=threshold,
        )
        self.stats = {"hits": 0, "misses": 0}
 
    def query(self, prompt: str) -> dict:
        results = self.cache.check(prompt=prompt)
        if results:
            self.stats["hits"] += 1
            return {"source": "cache", "response": results[0]["response"]}
 
        self.stats["misses"] += 1
        return {"source": "miss", "response": None}
 
    def store(self, prompt: str, response: str, metadata: dict = None):
        self.cache.store(
            prompt=prompt,
            response=response,
            metadata=metadata or {}
        )
 
    @property
    def hit_rate(self) -> float:
        total = self.stats["hits"] + self.stats["misses"]
        return self.stats["hits"] / total if total > 0 else 0.0
 
# Usage
cache = AgentSemanticCache(threshold=0.15)
result = cache.query("What is the capital of France?")
if result["source"] == "miss":
    llm_response = call_llm("What is the capital of France?")
    cache.store("What is the capital of France?", llm_response)
# Later: "Tell me France's capital city" -> cache HIT (semantic match)

Layer 3: KV Cache and Prefix Caching

Operates at the inference engine level, not the application level. Caches intermediate computation states.

Prefix caching reuses KV states when multiple requests share a common prefix (e.g., system prompt). Supported by vLLM, SGLang, and Anthropic's API.

Provider/Engine	Feature	Savings
Anthropic API	Prompt caching	90% off cached input tokens
OpenAI API	Automatic prefix caching	50% off cached input tokens
vLLM	–enable-prefix-caching	Eliminates recomputation of shared prefixes
SGLang	RadixAttention	Automatic prefix tree caching

Anthropic prompt caching example: A 10K-token system prompt cached across requests costs $0.30/M tokens instead of $3.00/M (90% savings on those tokens).

Layer 4: Tool Result Caching

Agents call external tools (APIs, databases, search engines) that are often slow and rate-limited. Cache these results independently.

import hashlib
import json
import time
import redis
 
class ToolResultCache:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.client = redis.from_url(redis_url)
        # TTL per tool type: volatile data gets shorter TTL
        self.ttl_config = {
            "web_search": 3600,       # 1 hour
            "database_query": 300,     # 5 min
            "weather_api": 1800,       # 30 min
            "static_lookup": 86400,    # 24 hours
            "calculation": 604800,     # 7 days - deterministic
        }
 
    def _cache_key(self, tool_name: str, args: dict) -> str:
        content = json.dumps({"tool": tool_name, "args": args}, sort_keys=True)
        return f"tool:{hashlib.sha256(content.encode()).hexdigest()}"
 
    def get(self, tool_name: str, args: dict) -> dict | None:
        key = self._cache_key(tool_name, args)
        result = self.client.get(key)
        if result:
            data = json.loads(result)
            age = time.time() - data["cached_at"]
            return {"result": data["result"], "cached": True, "age_seconds": age}
        return None
 
    def set(self, tool_name: str, args: dict, result):
        key = self._cache_key(tool_name, args)
        ttl = self.ttl_config.get(tool_name, 3600)
        data = json.dumps({"result": result, "cached_at": time.time()})
        self.client.setex(key, ttl, data)

Layer 5: Embedding Cache

If your agent generates embeddings for RAG or semantic search, cache them to avoid recomputation.

Embedding generation costs $0.02-0.13 per million tokens
Computation takes 50-200ms per batch
Cache embeddings keyed by content hash
Invalidate only when source content changes

Exact-Match vs Semantic Cache: When to Use Which

Criteria	Exact-Match	Semantic
Query diversity	Low (templated, structured)	High (natural language, varied)
Precision requirement	Must be 100%	95%+ acceptable
Latency budget	< 1ms	1-5ms
Setup complexity	Simple (hash + KV store)	Medium (embeddings + vector DB)
Typical hit rate	5-60%	20-40%
Best for	API tools, structured queries	User-facing chat, search

Recommendation: Use both layers. Exact-match as L1 (fast, precise), semantic as L2 (catches paraphrases).⁵⁾)

Tuning Semantic Cache Thresholds

The similarity threshold controls the tradeoff between hit rate and accuracy:

Threshold (cosine)	Hit Rate	False Positive Risk	Use Case
> 0.95	Low (5-10%)	Very low	High-stakes (medical, legal)
0.85-0.95	Medium (15-25%)	Low	General Q&A
0.80-0.85	High (25-40%)	Moderate	Customer support, FAQs
< 0.80	Very high (40%+)	High	Only for non-critical, high-volume

Production Architecture

Combined cache hit rates from production:

L1 exact-match: 10-15% of total requests
L2 semantic: 20-30% of remaining requests
L4 tool results: 30-50% of tool calls avoided
Net effect: 40-60% of requests never reach the LLM

Monitoring and Invalidation

Critical metrics to track:

Hit rate per layer - target: >20% for semantic, >5% for exact
False positive rate - sample and verify cached responses weekly
Cache staleness - set TTLs appropriate to data volatility
Memory usage - monitor Redis memory, set eviction policies (allkeys-lru)
Cost savings - track (cache_hits * avg_api_cost) monthly

References

¹⁾

Top AI Gateways with Semantic Caching - Dev.to (2026

²⁾

How Semantic Caching Saves Thousands - Level Up Coding (2025

³⁾

https://nordicapis.com/caching-strategies-for-ai-agent-traffic/|Caching Strategies for AI Agent Traffic - Nordic APIs (2025

⁴⁾

https://redis.io/docs/latest/develop/ai/redisvl/0.7.0/user_guide/llmcache/|Semantic Caching for LLMs - Redis Documentation (2026

⁵⁾

https://redis.io/blog/10-techniques-for-semantic-cache-optimization/|10 Techniques for Semantic Cache Optimization - Redis Blog (2025

AI Agent Knowledge Base

Sidebar

Table of Contents

Caching Strategies for Agents

Why Caching Matters for Agents

The Caching Layer Stack

Layer 1: Exact-Match Caching

Layer 2: Semantic Caching

Layer 3: KV Cache and Prefix Caching

Layer 4: Tool Result Caching

Layer 5: Embedding Cache

Exact-Match vs Semantic Cache: When to Use Which

Tuning Semantic Cache Thresholds

Production Architecture

Monitoring and Invalidation

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Caching Strategies for Agents

Why Caching Matters for Agents

The Caching Layer Stack

Layer 1: Exact-Match Caching

Layer 2: Semantic Caching

Layer 3: KV Cache and Prefix Caching

Layer 4: Tool Result Caching

Layer 5: Embedding Cache

Exact-Match vs Semantic Cache: When to Use Which

Tuning Semantic Cache Thresholds

Production Architecture

Monitoring and Invalidation

See Also

References

Page Tools