====== How to Reduce Token Costs ======

Reducing token costs is one of the most impactful optimizations for LLM-powered applications. Production teams report **50-85% cost reductions** by layering techniques like prompt compression, semantic caching, and intelligent model routing. This guide covers proven strategies with real numbers.(([[https://redis.io/blog/llm-token-optimization-speed-up-apps/|LLM Token Optimization]]))(([[https://blog.premai.io/llm-cost-optimization-8-strategies-that-cut-api-spend-by-80-2026-guide/|8 Strategies That Cut API Spend by 80%]]))(([[https://www.glukhov.org/post/2025/11/cost-effective-llm-applications/|Cost-Effective LLM Applications]]))(([[https://www.pluralsight.com/resources/blog/ai-and-data/how-cut-llm-costs-with-metering|How to Cut LLM Costs with Metering]]))

===== The Token Cost Problem =====

Every API call to an LLM is billed by token count. A single GPT-4o request processing a 10-page document can cost $0.05-0.15. At scale (100K+ requests/month), this compounds to thousands of dollars monthly. The key insight: most of those tokens are wasted.

===== Current API Pricing Landscape (2026) =====

Understanding the pricing tiers is essential for cost optimization:

^ Provider ^ Model ^ Input ($/M tokens) ^ Output ($/M tokens) ^ Notes ^
| OpenAI | GPT-4o | $2.50 | $10.00 | Flagship reasoning |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | 16x cheaper than flagship |
| Anthropic | Claude Sonnet 4 | $3.00 | $15.00 | Strong reasoning |
| Anthropic | Claude Haiku 3.5 | $0.25 | $1.25 | Fast, budget tier |
| Google | Gemini 2.5 Pro | $2.00 | $8.00 | Long context strength |
| Google | Gemini 2.0 Flash | $0.10 | $0.40 | Cheapest major model |

//Prices verified February 2026. Always check provider docs for current rates.//

===== Technique 1: Prompt Compression =====

**LLMLingua** (Microsoft Research) compresses prompts by removing redundant tokens while preserving semantic meaning.(([[https://arxiv.org/abs/2310.05736|LLMLingua: Compressing Prompts for Accelerated Inference]]))

**Measured Results:**
  * 30-40% token reduction with minimal performance loss
  * LongLLMLingua achieves up to **10x compression** on long contexts
  * 90%+ task performance retention after compression
  * Direct translation to 30-40% cost savings on input tokens

<code python>
from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    device_map="cpu"
)

def compress_and_query(prompt, context, question, target_ratio=0.5):
    compressed = compressor.compress_prompt(
        context=[context],
        instruction=prompt,
        question=question,
        rate=target_ratio,
        condition_in_question="after"
    )

    original_tokens = compressed["origin_tokens"]
    compressed_tokens = compressed["compressed_tokens"]
    savings_pct = (1 - compressed_tokens / original_tokens) * 100

    print(f"Tokens: {original_tokens} -> {compressed_tokens} ({savings_pct:.1f}% saved)")
    return compressed["compressed_prompt"]
</code>

===== Technique 2: Model Routing =====

Route queries to the cheapest model capable of handling them. Production deployments show **70-80% of queries** can be handled by budget models.

<mermaid>
graph TD
    A[Incoming Query] --> B[Complexity Classifier]
    B -->|Simple 70%| C["GPT-4o-mini $0.15/M input"]
    B -->|Medium 25%| D["GPT-4o $2.50/M input"]
    B -->|Complex 5%| E["Claude Opus $15.00/M input"]
    C --> F[Response]
    D --> F
    E --> F
    F --> G{Quality Check}
    G -->|Pass| H[Return Response]
    G -->|Fail| I[Escalate to Next Tier]
    I --> B
</mermaid>

<code python>
import openai
from enum import Enum

class ModelTier(Enum):
    BUDGET = "gpt-4o-mini"       # $0.15/M input
    STANDARD = "gpt-4o"          # $2.50/M input
    PREMIUM = "claude-opus-4"    # $15.00/M input

class ModelRouter:
    COMPLEXITY_SIGNALS = {
        "simple": ["summarize", "translate", "extract", "list", "format"],
        "complex": ["analyze", "reason", "compare", "evaluate", "multi-step"]
    }

    def classify(self, query: str) -> ModelTier:
        query_lower = query.lower()
        if any(sig in query_lower for sig in self.COMPLEXITY_SIGNALS["complex"]):
            return ModelTier.STANDARD
        return ModelTier.BUDGET

    async def route(self, query: str, max_tier: ModelTier = ModelTier.PREMIUM):
        tier = self.classify(query)
        try:
            return await self._call_model(tier.value, query)
        except Exception:
            tiers = list(ModelTier)
            current_idx = tiers.index(tier)
            if current_idx + 1 < len(tiers):
                return await self._call_model(tiers[current_idx + 1].value, query)
            raise

    async def _call_model(self, model: str, query: str):
        client = openai.AsyncOpenAI()
        return await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": query}]
        )
</code>

**Published results from RouteLLM:** up to **85% cost reduction** without quality loss by routing 60% of simple queries to budget models.

===== Technique 3: Semantic Caching =====

Cache responses for semantically similar queries. Production systems report **20-45% cache hit rates**, eliminating LLM calls entirely for those requests.

<code python>
from redisvl.extensions.llmcache import SemanticCache

cache = SemanticCache(
    name="llm_cache",
    redis_url="redis://localhost:6379",
    distance_threshold=0.15  # Lower = stricter matching
)

def cached_query(prompt: str) -> str:
    results = cache.check(prompt=prompt)
    if results:
        return results[0]["response"]

    response = call_llm(prompt)
    cache.store(prompt=prompt, response=response, metadata={"model": "gpt-4o-mini"})
    return response
</code>

===== Technique 4: Context Window Management =====

Strategies to reduce tokens sent per request:

  * **Sliding window context:** Keep only the last N messages instead of full history
  * **Summarize old context:** Compress conversation history into summaries
  * **Selective RAG:** Retrieve only the most relevant chunks, not entire documents
  * **Output token limits:** Set ''max_tokens'' to prevent verbose responses (saves 20-40% on output)

===== Technique 5: Batch API Processing =====

OpenAI's Batch API offers **50% discount** for non-latency-sensitive workloads. Process overnight analytics, bulk classification, and embedding generation at half cost.

===== Combined Savings: Real Case Study =====

**Customer support chatbot (100K requests/month):**

^ Strategy ^ Before ^ After ^ Savings ^
| Model routing (80% to mini) | $4,200/mo | $1,260/mo | 70% |
| + Semantic caching (45% hits) | $1,260/mo | $693/mo | 45% |
| + Prompt compression (40%) | $693/mo | $416/mo | 40% |
| **Combined** | **$4,200/mo** | **$416/mo** | **90%** |

===== Decision Framework =====

<mermaid>
graph TD
    A[Start: High Token Costs] --> B{Query repetition > 20%?}
    B -->|Yes| C[Implement Semantic Cache]
    B -->|No| D{Mixed complexity queries?}
    C --> D
    D -->|Yes| E[Add Model Router]
    D -->|No| F{Long prompts or contexts?}
    E --> F
    F -->|Yes| G[Add Prompt Compression]
    F -->|No| H{Non-realtime workloads?}
    G --> H
    H -->|Yes| I[Use Batch API]
    H -->|No| J[Optimize Context Windows]
    I --> K[Monitor and Iterate]
    J --> K
</mermaid>

===== See Also =====

  * [[caching_strategies_for_agents|Caching Strategies for Agents]]
  * [[how_to_speed_up_agents|How to Speed Up Agents]]
  * [[what_is_an_ai_agent|What is an AI Agent]]

===== References =====