====== Agent Cost Optimization ====== Agent cost optimization is the discipline of managing token economics, inference costs, and compute budgets for production LLM agent systems. Agents make 3-10x more LLM calls than simple chatbots --- a single user request can trigger planning, tool selection, execution, verification, and response generation, easily consuming 5x the token budget of a direct chat completion. An unconstrained coding agent can cost $5-8 per task in API fees alone. graph TD A[User Query] --> B{Model Router} B -->|Simple| C[Cheap Model] B -->|Complex| D[Frontier Model] C --> E{Cache Hit?} D --> E E -->|Yes| F[Return Cached Response] E -->|No| G[Prompt Compression] G --> H[Execute LLM Call] H --> I[Track Costs] I --> J[Response] ===== The Real Cost Structure ===== Standard LLM pricing appears simple (pay per input/output token), but agents introduce compounding cost multipliers: * **Multi-turn loops:** A ReAct loop running 10 cycles can consume 50x the tokens of a single linear pass * **Context accumulation:** A 10-turn interaction costs 5x more with naive history appending * **Tool overhead:** Each tool call adds tokens for the tool schema, the call, and the result parsing * **Multi-agent coordination:** Orchestrator-worker patterns multiply token usage across agents **Real production cost data:** | **Agent Type** | **Monthly Operational Cost** | **Dev Cost** | | HR onboarding agent | $2,000-$5,000/mo | $50K-$100K | | Legal document review | $4,000-$10,000/mo | $100K-$200K | | Supply chain optimization | $5,000-$12,000/mo | $120K-$250K | | Software engineering agent | $5-$8 per task | Variable | ===== Pillar 1: Prompt Caching ===== Prompt caching reuses previously computed key-value (KV) attention tensors for repeated prompt prefixes. For agents that resend the same system prompt, tool definitions, and conversation history across dozens of API calls, caching eliminates 40-90% of redundant computation. **Provider-specific caching:** | **Provider** | **Mechanism** | **Discount** | **Cache TTL** | | Anthropic | Automatic prefix caching | 90% on cached input tokens | 5 minutes | | OpenAI | Automatic prefix caching | 50% on cached input tokens | ~1 hour | | Google | Context caching API | 75% on cached tokens | Configurable | **Cache-friendly architecture:** Keep static content (system prompt, tool definitions, few-shot examples) at the //beginning// of the prompt. Append dynamic content at the end to maximize prefix overlap. $$C_{effective} = C_{uncached} \times (1 - hit\_rate \times discount)$$ For a 90% cache hit rate with Anthropic's 90% discount: $C_{effective} = C_{uncached} \times 0.19$ --- an 81% reduction. ===== Pillar 2: Model Routing ===== Not every agent step requires a frontier model. Model routing classifies tasks by complexity and directs them to the cheapest sufficient model. # Model routing for agent cost optimization from dataclasses import dataclass from enum import Enum class ModelTier(Enum): CHEAP = "gpt-4o-mini" # $0.15/$0.60 per M tokens MID = "claude-3.5-haiku" # $0.80/$4.00 per M tokens PREMIUM = "claude-sonnet" # $3.00/$15.00 per M tokens FRONTIER = "claude-opus" # $15.00/$75.00 per M tokens @dataclass class RoutingDecision: model: ModelTier reason: str class AgentRouter: def __init__(self, classifier): self.classifier = classifier def route(self, task, context): complexity = self.classifier.classify(task) if complexity == "simple": return RoutingDecision(ModelTier.CHEAP, "Routine subtask") elif complexity == "moderate": return RoutingDecision(ModelTier.MID, "Standard reasoning") elif complexity == "complex": return RoutingDecision(ModelTier.PREMIUM, "Complex reasoning") else: return RoutingDecision(ModelTier.FRONTIER, "Max capability needed") def estimate_savings(self, task_distribution): costs = { ModelTier.CHEAP: 0.15, ModelTier.MID: 0.80, ModelTier.PREMIUM: 3.00, ModelTier.FRONTIER: 15.00 } routed = sum(costs[self.route(t, {}).model] * pct for t, pct in task_distribution.items()) return 1 - (routed / 15.00) A typical task distribution (60% simple, 25% moderate, 12% complex, 3% frontier) with routing yields up to 80% cost reduction versus routing everything through a frontier model. ===== Pillar 3: Prompt Compression ===== Prompt compression reduces token count while preserving semantic content: * **LLMLingua-2:** Compresses prompts up to 5x by identifying and removing redundant tokens * **Incremental summarization:** Replace full conversation history with rolling summaries * **Observation masking:** Strip verbose tool outputs to essential fields * **Schema pruning:** Include only relevant tool definitions per step, not the full catalog $$C_{optimized} = C_{base} \times compression\_ratio \times (1 - cache\_hit \times cache\_discount)$$ ===== Pillar 4: Semantic Caching ===== Semantic caching stores LLM responses for similar queries in vector databases, eliminating API calls entirely for 20-40% of repetitive traffic. Unlike exact-match caching, semantic caching uses embedding similarity: * "What are your business hours?" matches "When are you open?" * Threshold tuning balances hit rate against response accuracy Tools: Redis with vector search, GPTCache, Pinecone-based solutions. ===== Pillar 5: FinOps and Observability ===== Production agent cost management requires instrumentation from day one: * **Per-step cost tracking:** Log token usage, model used, and cost for every LLM call * **Budget guardrails:** Set per-request and per-user token limits with graceful degradation * **Anomaly detection:** Alert on cost spikes from infinite loops or unexpected tool chains * **Batch processing:** Route non-interactive tasks to batch APIs for 50% token savings **Tooling:** LangSmith, Braintrust, Helicone, and custom dashboards built on provider usage APIs. ===== Combined Impact ===== | **Technique** | **Cost Reduction** | **Implementation Effort** | | Prompt caching | 40-81% on input tokens | Low (architecture change) | | Model routing | Up to 80% overall | Medium (classifier needed) | | Prompt compression | 50-80% on token count | Medium (tooling integration) | | Semantic caching | 20-40% calls eliminated | Medium (vector DB setup) | | Batch processing | 50% on async tasks | Low (API flag) | Combining all techniques can reduce agent costs by 70-90% compared to naive implementations. ===== References ===== * [[https://zylos.ai/research/2026-02-19-ai-agent-cost-optimization-token-economics|AI Agent Cost Optimization: Token Economics and FinOps (Zylos, 2026)]] * [[https://zylos.ai/research/2026-02-24-prompt-caching-ai-agents-architecture|Prompt Caching Architecture Patterns for AI Agents (Zylos, 2026)]] * [[https://techplustrends.com/agentic-ai-token-economics-cost-engineering/|Token Economics for Agentic AI: The 2026 ROI Playbook]] * [[https://neontri.com/blog/ai-agent-development-cost/|AI Agent Development Cost Guide (Neontri, 2026)]] * [[https://arxiv.org/abs/2403.12968|LLMLingua-2: Data Distillation for Prompt Compression (Microsoft, 2024)]] ===== See Also ===== * [[small_language_model_agents]] * [[agentic_rpa]] * [[multimodal_agent_architectures]]