Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Agent cost optimization is the discipline of managing token economics, inference costs, and compute budgets for production LLM agent systems. Agents make 3-10x more LLM calls than simple chatbots — a single user request can trigger planning, tool selection, execution, verification, and response generation, easily consuming 5x the token budget of a direct chat completion. An unconstrained coding agent can cost $5-8 per task in API fees alone.
Standard LLM pricing appears simple (pay per input/output token), but agents introduce compounding cost multipliers:
Real production cost data:
| Agent Type | Monthly Operational Cost | Dev Cost |
| HR onboarding agent | $2,000-$5,000/mo | $50K-$100K |
| Legal document review | $4,000-$10,000/mo | $100K-$200K |
| Supply chain optimization | $5,000-$12,000/mo | $120K-$250K |
| Software engineering agent | $5-$8 per task | Variable |
Prompt caching reuses previously computed key-value (KV) attention tensors for repeated prompt prefixes. For agents that resend the same system prompt, tool definitions, and conversation history across dozens of API calls, caching eliminates 40-90% of redundant computation.
Provider-specific caching:
| Provider | Mechanism | Discount | Cache TTL |
| Anthropic | Automatic prefix caching | 90% on cached input tokens | 5 minutes |
| OpenAI | Automatic prefix caching | 50% on cached input tokens | ~1 hour |
| Context caching API | 75% on cached tokens | Configurable |
Cache-friendly architecture: Keep static content (system prompt, tool definitions, few-shot examples) at the beginning of the prompt. Append dynamic content at the end to maximize prefix overlap.
$$C_{effective} = C_{uncached} \times (1 - hit\_rate \times discount)$$
For a 90% cache hit rate with Anthropic's 90% discount: $C_{effective} = C_{uncached} \times 0.19$ — an 81% reduction.
Not every agent step requires a frontier model. Model routing classifies tasks by complexity and directs them to the cheapest sufficient model.
# Model routing for agent cost optimization from dataclasses import dataclass from enum import Enum class ModelTier(Enum): CHEAP = "gpt-4o-mini" # $0.15/$0.60 per M tokens MID = "claude-3.5-haiku" # $0.80/$4.00 per M tokens PREMIUM = "claude-sonnet" # $3.00/$15.00 per M tokens FRONTIER = "claude-opus" # $15.00/$75.00 per M tokens @dataclass class RoutingDecision: model: ModelTier reason: str class AgentRouter: def __init__(self, classifier): self.classifier = classifier def route(self, task, context): complexity = self.classifier.classify(task) if complexity == "simple": return RoutingDecision(ModelTier.CHEAP, "Routine subtask") elif complexity == "moderate": return RoutingDecision(ModelTier.MID, "Standard reasoning") elif complexity == "complex": return RoutingDecision(ModelTier.PREMIUM, "Complex reasoning") else: return RoutingDecision(ModelTier.FRONTIER, "Max capability needed") def estimate_savings(self, task_distribution): costs = { ModelTier.CHEAP: 0.15, ModelTier.MID: 0.80, ModelTier.PREMIUM: 3.00, ModelTier.FRONTIER: 15.00 } routed = sum(costs[self.route(t, {}).model] * pct for t, pct in task_distribution.items()) return 1 - (routed / 15.00)
A typical task distribution (60% simple, 25% moderate, 12% complex, 3% frontier) with routing yields up to 80% cost reduction versus routing everything through a frontier model.
Prompt compression reduces token count while preserving semantic content:
$$C_{optimized} = C_{base} \times compression\_ratio \times (1 - cache\_hit \times cache\_discount)$$
Semantic caching stores LLM responses for similar queries in vector databases, eliminating API calls entirely for 20-40% of repetitive traffic. Unlike exact-match caching, semantic caching uses embedding similarity:
Tools: Redis with vector search, GPTCache, Pinecone-based solutions.
Production agent cost management requires instrumentation from day one:
Tooling: LangSmith, Braintrust, Helicone, and custom dashboards built on provider usage APIs.
| Technique | Cost Reduction | Implementation Effort |
| Prompt caching | 40-81% on input tokens | Low (architecture change) |
| Model routing | Up to 80% overall | Medium (classifier needed) |
| Prompt compression | 50-80% on token count | Medium (tooling integration) |
| Semantic caching | 20-40% calls eliminated | Medium (vector DB setup) |
| Batch processing | 50% on async tasks | Low (API flag) |
Combining all techniques can reduce agent costs by 70-90% compared to naive implementations.