This is an old revision of the document!

Agent Cost Optimization

Agent cost optimization is the discipline of managing token economics, inference costs, and compute budgets for production LLM agent systems. Agents make 3-10x more LLM calls than simple chatbots — a single user request can trigger planning, tool selection, execution, verification, and response generation, easily consuming 5x the token budget of a direct chat completion. An unconstrained coding agent can cost $5-8 per task in API fees alone.

The Real Cost Structure

Standard LLM pricing appears simple (pay per input/output token), but agents introduce compounding cost multipliers:

Multi-turn loops: A ReAct loop running 10 cycles can consume 50x the tokens of a single linear pass
Context accumulation: A 10-turn interaction costs 5x more with naive history appending
Tool overhead: Each tool call adds tokens for the tool schema, the call, and the result parsing
Multi-agent coordination: Orchestrator-worker patterns multiply token usage across agents

Real production cost data:

Agent Type	Monthly Operational Cost	Dev Cost
HR onboarding agent	$2,000-$5,000/mo	$50K-$100K
Legal document review	$4,000-$10,000/mo	$100K-$200K
Supply chain optimization	$5,000-$12,000/mo	$120K-$250K
Software engineering agent	$5-$8 per task	Variable

Pillar 1: Prompt Caching

Prompt caching reuses previously computed key-value (KV) attention tensors for repeated prompt prefixes. For agents that resend the same system prompt, tool definitions, and conversation history across dozens of API calls, caching eliminates 40-90% of redundant computation.

Provider-specific caching:

Provider	Mechanism	Discount	Cache TTL
Anthropic	Automatic prefix caching	90% on cached input tokens	5 minutes
OpenAI	Automatic prefix caching	50% on cached input tokens	~1 hour
Google	Context caching API	75% on cached tokens	Configurable

Cache-friendly architecture: Keep static content (system prompt, tool definitions, few-shot examples) at the beginning of the prompt. Append dynamic content at the end to maximize prefix overlap.

$$C_{effective} = C_{uncached} \times (1 - hit\_rate \times discount)$$

For a 90% cache hit rate with Anthropic's 90% discount: $C_{effective} = C_{uncached} \times 0.19$ — an 81% reduction.

Pillar 2: Model Routing

Not every agent step requires a frontier model. Model routing classifies tasks by complexity and directs them to the cheapest sufficient model.

# Model routing for agent cost optimization
from dataclasses import dataclass
from enum import Enum
 
class ModelTier(Enum):
    CHEAP = "gpt-4o-mini"       # $0.15/$0.60 per M tokens
    MID = "claude-3.5-haiku"    # $0.80/$4.00 per M tokens
    PREMIUM = "claude-sonnet"   # $3.00/$15.00 per M tokens
    FRONTIER = "claude-opus"    # $15.00/$75.00 per M tokens
 
@dataclass
class RoutingDecision:
    model: ModelTier
    reason: str
 
class AgentRouter:
    def __init__(self, classifier):
        self.classifier = classifier
 
    def route(self, task, context):
        complexity = self.classifier.classify(task)
        if complexity == "simple":
            return RoutingDecision(ModelTier.CHEAP, "Routine subtask")
        elif complexity == "moderate":
            return RoutingDecision(ModelTier.MID, "Standard reasoning")
        elif complexity == "complex":
            return RoutingDecision(ModelTier.PREMIUM, "Complex reasoning")
        else:
            return RoutingDecision(ModelTier.FRONTIER, "Max capability needed")
 
    def estimate_savings(self, task_distribution):
        costs = {
            ModelTier.CHEAP: 0.15, ModelTier.MID: 0.80,
            ModelTier.PREMIUM: 3.00, ModelTier.FRONTIER: 15.00
        }
        routed = sum(costs[self.route(t, {}).model] * pct
                     for t, pct in task_distribution.items())
        return 1 - (routed / 15.00)

A typical task distribution (60% simple, 25% moderate, 12% complex, 3% frontier) with routing yields up to 80% cost reduction versus routing everything through a frontier model.

Pillar 3: Prompt Compression

Prompt compression reduces token count while preserving semantic content:

LLMLingua-2: Compresses prompts up to 5x by identifying and removing redundant tokens
Incremental summarization: Replace full conversation history with rolling summaries
Observation masking: Strip verbose tool outputs to essential fields
Schema pruning: Include only relevant tool definitions per step, not the full catalog

$$C_{optimized} = C_{base} \times compression\_ratio \times (1 - cache\_hit \times cache\_discount)$$

Pillar 4: Semantic Caching

Semantic caching stores LLM responses for similar queries in vector databases, eliminating API calls entirely for 20-40% of repetitive traffic. Unlike exact-match caching, semantic caching uses embedding similarity:

“What are your business hours?” matches “When are you open?”
Threshold tuning balances hit rate against response accuracy

Tools: Redis with vector search, GPTCache, Pinecone-based solutions.

Pillar 5: FinOps and Observability

Production agent cost management requires instrumentation from day one:

Per-step cost tracking: Log token usage, model used, and cost for every LLM call
Budget guardrails: Set per-request and per-user token limits with graceful degradation
Anomaly detection: Alert on cost spikes from infinite loops or unexpected tool chains
Batch processing: Route non-interactive tasks to batch APIs for 50% token savings

Tooling: LangSmith, Braintrust, Helicone, and custom dashboards built on provider usage APIs.

Combined Impact

Technique	Cost Reduction	Implementation Effort
Prompt caching	40-81% on input tokens	Low (architecture change)
Model routing	Up to 80% overall	Medium (classifier needed)
Prompt compression	50-80% on token count	Medium (tooling integration)
Semantic caching	20-40% calls eliminated	Medium (vector DB setup)
Batch processing	50% on async tasks	Low (API flag)

Combining all techniques can reduce agent costs by 70-90% compared to naive implementations.

AI Agent Knowledge Base

Sidebar

Table of Contents

Agent Cost Optimization

The Real Cost Structure

Pillar 1: Prompt Caching

Pillar 2: Model Routing

Pillar 3: Prompt Compression

Pillar 4: Semantic Caching

Pillar 5: FinOps and Observability

Combined Impact

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Agent Cost Optimization

The Real Cost Structure

Pillar 1: Prompt Caching

Pillar 2: Model Routing

Pillar 3: Prompt Compression

Pillar 4: Semantic Caching

Pillar 5: FinOps and Observability

Combined Impact

References

See Also

Page Tools