This is an old revision of the document!

How to Reduce Token Costs

Reducing token costs is one of the most impactful optimizations for LLM-powered applications. Production teams report 50-85% cost reductions by layering techniques like prompt compression, semantic caching, and intelligent model routing. This guide covers proven strategies with real numbers.

The Token Cost Problem

Every API call to an LLM is billed by token count. A single GPT-4o request processing a 10-page document can cost $0.05-0.15. At scale (100K+ requests/month), this compounds to thousands of dollars monthly. The key insight: most of those tokens are wasted.

Current API Pricing Landscape (2026)

Understanding the pricing tiers is essential for cost optimization:

Provider	Model	Input ($/M tokens) ^ Output ($/M tokens)	Notes
OpenAI	GPT-4o	$2.50 \| $10.00	Flagship reasoning
OpenAI	GPT-4o-mini	$0.15 \| $0.60	16x cheaper than flagship
Anthropic	Claude Sonnet 4	$3.00 \| $15.00	Strong reasoning
Anthropic	Claude Haiku 3.5	$0.25 \| $1.25	Fast, budget tier
Google	Gemini 2.5 Pro	$2.00 \| $8.00	Long context strength
Google	Gemini 2.0 Flash	$0.10 \| $0.40	Cheapest major model

Prices verified February 2026. Always check provider docs for current rates.

Technique 1: Prompt Compression

LLMLingua (Microsoft Research) compresses prompts by removing redundant tokens while preserving semantic meaning.

Measured Results:

30-40% token reduction with minimal performance loss
LongLLMLingua achieves up to 10x compression on long contexts
90%+ task performance retention after compression
Direct translation to 30-40% cost savings on input tokens

from llmlingua import PromptCompressor
 
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    device_map="cpu"
)
 
def compress_and_query(prompt, context, question, target_ratio=0.5):
    compressed = compressor.compress_prompt(
        context=[context],
        instruction=prompt,
        question=question,
        rate=target_ratio,
        condition_in_question="after"
    )
 
    original_tokens = compressed["origin_tokens"]
    compressed_tokens = compressed["compressed_tokens"]
    savings_pct = (1 - compressed_tokens / original_tokens) * 100
 
    print(f"Tokens: {original_tokens} -> {compressed_tokens} ({savings_pct:.1f}% saved)")
    return compressed["compressed_prompt"]

Technique 2: Model Routing

Route queries to the cheapest model capable of handling them. Production deployments show 70-80% of queries can be handled by budget models.

graph TD A[Incoming Query] --> B[Complexity Classifier] B -->|Simple 70%| C["GPT-4o-mini $0.15/M input"] B -->|Medium 25%| D["GPT-4o $2.50/M input"] B -->|Complex 5%| E["Claude Opus $15.00/M input"] C --> F[Response] D --> F E --> F F --> G{Quality Check} G -->|Pass| H[Return Response] G -->|Fail| I[Escalate to Next Tier] I --> B

import openai
from enum import Enum
 
class ModelTier(Enum):
    BUDGET = "gpt-4o-mini"       # $0.15/M input
    STANDARD = "gpt-4o"          # $2.50/M input
    PREMIUM = "claude-opus-4"    # $15.00/M input
 
class ModelRouter:
    COMPLEXITY_SIGNALS = {
        "simple": ["summarize", "translate", "extract", "list", "format"],
        "complex": ["analyze", "reason", "compare", "evaluate", "multi-step"]
    }
 
    def classify(self, query: str) -> ModelTier:
        query_lower = query.lower()
        if any(sig in query_lower for sig in self.COMPLEXITY_SIGNALS["complex"]):
            return ModelTier.STANDARD
        return ModelTier.BUDGET
 
    async def route(self, query: str, max_tier: ModelTier = ModelTier.PREMIUM):
        tier = self.classify(query)
        try:
            return await self._call_model(tier.value, query)
        except Exception:
            tiers = list(ModelTier)
            current_idx = tiers.index(tier)
            if current_idx + 1 < len(tiers):
                return await self._call_model(tiers[current_idx + 1].value, query)
            raise
 
    async def _call_model(self, model: str, query: str):
        client = openai.AsyncOpenAI()
        return await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": query}]
        )

Published results from RouteLLM: up to 85% cost reduction without quality loss by routing 60% of simple queries to budget models.

Technique 3: Semantic Caching

Cache responses for semantically similar queries. Production systems report 20-45% cache hit rates, eliminating LLM calls entirely for those requests.

from redisvl.extensions.llmcache import SemanticCache
 
cache = SemanticCache(
    name="llm_cache",
    redis_url="redis://localhost:6379",
    distance_threshold=0.15  # Lower = stricter matching
)
 
def cached_query(prompt: str) -> str:
    results = cache.check(prompt=prompt)
    if results:
        return results[0]["response"]
 
    response = call_llm(prompt)
    cache.store(prompt=prompt, response=response, metadata={"model": "gpt-4o-mini"})
    return response

Technique 4: Context Window Management

Strategies to reduce tokens sent per request:

Sliding window context: Keep only the last N messages instead of full history
Summarize old context: Compress conversation history into summaries
Selective RAG: Retrieve only the most relevant chunks, not entire documents
Output token limits: Set max_tokens to prevent verbose responses (saves 20-40% on output)

Technique 5: Batch API Processing

OpenAI's Batch API offers 50% discount for non-latency-sensitive workloads. Process overnight analytics, bulk classification, and embedding generation at half cost.

Combined Savings: Real Case Study

Customer support chatbot (100K requests/month):

Strategy	Before	After	Savings
Model routing (80% to mini)	$4,200/mo \| $1,260/mo	70%
+ Semantic caching (45% hits)	$1,260/mo \| $693/mo	45%
+ Prompt compression (40%)	$693/mo \| $416/mo	40%
Combined	$4,200/mo	$416/mo	90%

Decision Framework

graph TD A[Start: High Token Costs] --> B{Query repetition > 20%?} B -->|Yes| C[Implement Semantic Cache] B -->|No| D{Mixed complexity queries?} C --> D D -->|Yes| E[Add Model Router] D -->|No| F{Long prompts or contexts?} E --> F F -->|Yes| G[Add Prompt Compression] F -->|No| H{Non-realtime workloads?} G --> H H -->|Yes| I[Use Batch API] H -->|No| J[Optimize Context Windows] I --> K[Monitor and Iterate] J --> K

References

LLMLingua: Compressing Prompts for Accelerated Inference - Microsoft Research (2023)
LLM Token Optimization - Redis Blog (2025)
8 Strategies That Cut API Spend by 80% - PremAI (2026)
Cost-Effective LLM Applications - Glukhov (2025)
How to Cut LLM Costs with Metering - Pluralsight (2025)

AI Agent Knowledge Base

Sidebar

Table of Contents

How to Reduce Token Costs

The Token Cost Problem

Current API Pricing Landscape (2026)

Technique 1: Prompt Compression

Technique 2: Model Routing

Technique 3: Semantic Caching

Technique 4: Context Window Management

Technique 5: Batch API Processing

Combined Savings: Real Case Study

Decision Framework

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

How to Reduce Token Costs

The Token Cost Problem

Current API Pricing Landscape (2026)

Technique 1: Prompt Compression

Technique 2: Model Routing

Technique 3: Semantic Caching

Technique 4: Context Window Management

Technique 5: Batch API Processing

Combined Savings: Real Case Study

Decision Framework

References

See Also

Page Tools