Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Agent cost optimization is the discipline of managing token economics, inference costs, and compute budgets for production LLM agent systems. Agents make 3-10x more LLM calls than simple chatbots — a single user request can trigger planning, tool selection, execution, verification, and response generation, easily consuming 5x the token budget of a direct chat completion. An unconstrained coding agent can cost $5-8 per task in API fees alone.
Standard LLM pricing appears simple (pay per input/output token), but agents introduce compounding cost multipliers:
Real production cost data:
| Agent Type | Monthly Operational Cost | Dev Cost |
| HR onboarding agent | $2,000-$5,000/mo | $50K-$100K |
| Legal document review | $4,000-$10,000/mo | $100K-$200K |
| Supply chain optimization | $5,000-$12,000/mo | $120K-$250K |
| Software engineering agent | $5-$8 per task | Variable |
Prompt caching reuses previously computed key-value (KV) attention tensors for repeated prompt prefixes. For agents that resend the same system prompt, tool definitions, and conversation history across dozens of API calls, caching eliminates 40-90% of redundant computation.
Provider-specific caching:
| Provider | Mechanism | Discount | Cache TTL |
| Anthropic | Automatic prefix caching | 90% on cached input tokens | 5 minutes |
| OpenAI | Automatic prefix caching | 50% on cached input tokens | ~1 hour |
| Context caching API | 75% on cached tokens | Configurable |
Cache-friendly architecture: Keep static content (system prompt, tool definitions, few-shot examples) at the beginning of the prompt. Append dynamic content at the end to maximize prefix overlap.
$$C_{effective} = C_{uncached} \times (1 - hit\_rate \times discount)$$
For a 90% cache hit rate with Anthropic's 90% discount: $C_{effective} = C_{uncached} \times 0.19$ — an 81% reduction.
Not every agent step requires a frontier model. Model routing classifies tasks by complexity and directs them to the cheapest sufficient model.
# Model routing for agent cost optimization from dataclasses import dataclass from enum import Enum class ModelTier(Enum): CHEAP = "gpt-4o-mini" # $0.15/$0.60 per M tokens MID = "claude-3.5-haiku" # $0.80/$4.00 per M tokens PREMIUM = "claude-sonnet" # $3.00/$15.00 per M tokens FRONTIER = "claude-opus" # $15.00/$75.00 per M tokens @dataclass class RoutingDecision: model: ModelTier reason: str class AgentRouter: def __init__(self, classifier): self.classifier = classifier def route(self, task, context): complexity = self.classifier.classify(task) if complexity == "simple": return RoutingDecision(ModelTier.CHEAP, "Routine subtask") elif complexity == "moderate": return RoutingDecision(ModelTier.MID, "Standard reasoning") elif complexity == "complex": return RoutingDecision(ModelTier.PREMIUM, "Complex reasoning") else: return RoutingDecision(ModelTier.FRONTIER, "Max capability needed") def estimate_savings(self, task_distribution): costs = { ModelTier.CHEAP: 0.15, ModelTier.MID: 0.80, ModelTier.PREMIUM: 3.00, ModelTier.FRONTIER: 15.00 } routed = sum(costs[self.route(t, {}).model] * pct for t, pct in task_distribution.items()) return 1 - (routed / 15.00)
A typical task distribution (60% simple, 25% moderate, 12% complex, 3% frontier) with routing yields up to 80% cost reduction versus routing everything through a frontier model.
Prompt compression reduces token count while preserving semantic content:
$$C_{optimized} = C_{base} \times compression\_ratio \times (1 - cache\_hit \times cache\_discount)$$
Semantic caching stores LLM responses for similar queries in vector databases, eliminating API calls entirely for 20-40% of repetitive traffic. Unlike exact-match caching, semantic caching uses embedding similarity:
Tools: Redis with vector search, GPTCache, Pinecone-based solutions.
Production agent cost management requires instrumentation from day one:
Tooling: LangSmith, Braintrust, Helicone, and custom dashboards built on provider usage APIs.
| Technique | Cost Reduction | Implementation Effort |
| Prompt caching | 40-81% on input tokens | Low (architecture change) |
| Model routing | Up to 80% overall | Medium (classifier needed) |
| Prompt compression | 50-80% on token count | Medium (tooling integration) |
| Semantic caching | 20-40% calls eliminated | Medium (vector DB setup) |
| Batch processing | 50% on async tasks | Low (API flag) |
Combining all techniques can reduce agent costs by 70-90% compared to naive implementations.