====== How to Reduce Token Costs ======
Reducing token costs is one of the most impactful optimizations for LLM-powered applications. Production teams report **50-85% cost reductions** by layering techniques like prompt compression, semantic caching, and intelligent model routing. This guide covers proven strategies with real numbers.(([[https://redis.io/blog/llm-token-optimization-speed-up-apps/|LLM Token Optimization]]))(([[https://blog.premai.io/llm-cost-optimization-8-strategies-that-cut-api-spend-by-80-2026-guide/|8 Strategies That Cut API Spend by 80%]]))(([[https://www.glukhov.org/post/2025/11/cost-effective-llm-applications/|Cost-Effective LLM Applications]]))(([[https://www.pluralsight.com/resources/blog/ai-and-data/how-cut-llm-costs-with-metering|How to Cut LLM Costs with Metering]]))
===== The Token Cost Problem =====
Every API call to an LLM is billed by token count. A single GPT-4o request processing a 10-page document can cost $0.05-0.15. At scale (100K+ requests/month), this compounds to thousands of dollars monthly. The key insight: most of those tokens are wasted.
===== Current API Pricing Landscape (2026) =====
Understanding the pricing tiers is essential for cost optimization:
^ Provider ^ Model ^ Input ($/M tokens) ^ Output ($/M tokens) ^ Notes ^
| OpenAI | GPT-4o | $2.50 | $10.00 | Flagship reasoning |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | 16x cheaper than flagship |
| Anthropic | Claude Sonnet 4 | $3.00 | $15.00 | Strong reasoning |
| Anthropic | Claude Haiku 3.5 | $0.25 | $1.25 | Fast, budget tier |
| Google | Gemini 2.5 Pro | $2.00 | $8.00 | Long context strength |
| Google | Gemini 2.0 Flash | $0.10 | $0.40 | Cheapest major model |
//Prices verified February 2026. Always check provider docs for current rates.//
===== Technique 1: Prompt Compression =====
**LLMLingua** (Microsoft Research) compresses prompts by removing redundant tokens while preserving semantic meaning.(([[https://arxiv.org/abs/2310.05736|LLMLingua: Compressing Prompts for Accelerated Inference]]))
**Measured Results:**
* 30-40% token reduction with minimal performance loss
* LongLLMLingua achieves up to **10x compression** on long contexts
* 90%+ task performance retention after compression
* Direct translation to 30-40% cost savings on input tokens
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
device_map="cpu"
)
def compress_and_query(prompt, context, question, target_ratio=0.5):
compressed = compressor.compress_prompt(
context=[context],
instruction=prompt,
question=question,
rate=target_ratio,
condition_in_question="after"
)
original_tokens = compressed["origin_tokens"]
compressed_tokens = compressed["compressed_tokens"]
savings_pct = (1 - compressed_tokens / original_tokens) * 100
print(f"Tokens: {original_tokens} -> {compressed_tokens} ({savings_pct:.1f}% saved)")
return compressed["compressed_prompt"]
===== Technique 2: Model Routing =====
Route queries to the cheapest model capable of handling them. Production deployments show **70-80% of queries** can be handled by budget models.
graph TD
A[Incoming Query] --> B[Complexity Classifier]
B -->|Simple 70%| C["GPT-4o-mini $0.15/M input"]
B -->|Medium 25%| D["GPT-4o $2.50/M input"]
B -->|Complex 5%| E["Claude Opus $15.00/M input"]
C --> F[Response]
D --> F
E --> F
F --> G{Quality Check}
G -->|Pass| H[Return Response]
G -->|Fail| I[Escalate to Next Tier]
I --> B
import openai
from enum import Enum
class ModelTier(Enum):
BUDGET = "gpt-4o-mini" # $0.15/M input
STANDARD = "gpt-4o" # $2.50/M input
PREMIUM = "claude-opus-4" # $15.00/M input
class ModelRouter:
COMPLEXITY_SIGNALS = {
"simple": ["summarize", "translate", "extract", "list", "format"],
"complex": ["analyze", "reason", "compare", "evaluate", "multi-step"]
}
def classify(self, query: str) -> ModelTier:
query_lower = query.lower()
if any(sig in query_lower for sig in self.COMPLEXITY_SIGNALS["complex"]):
return ModelTier.STANDARD
return ModelTier.BUDGET
async def route(self, query: str, max_tier: ModelTier = ModelTier.PREMIUM):
tier = self.classify(query)
try:
return await self._call_model(tier.value, query)
except Exception:
tiers = list(ModelTier)
current_idx = tiers.index(tier)
if current_idx + 1 < len(tiers):
return await self._call_model(tiers[current_idx + 1].value, query)
raise
async def _call_model(self, model: str, query: str):
client = openai.AsyncOpenAI()
return await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}]
)
**Published results from RouteLLM:** up to **85% cost reduction** without quality loss by routing 60% of simple queries to budget models.
===== Technique 3: Semantic Caching =====
Cache responses for semantically similar queries. Production systems report **20-45% cache hit rates**, eliminating LLM calls entirely for those requests.
from redisvl.extensions.llmcache import SemanticCache
cache = SemanticCache(
name="llm_cache",
redis_url="redis://localhost:6379",
distance_threshold=0.15 # Lower = stricter matching
)
def cached_query(prompt: str) -> str:
results = cache.check(prompt=prompt)
if results:
return results[0]["response"]
response = call_llm(prompt)
cache.store(prompt=prompt, response=response, metadata={"model": "gpt-4o-mini"})
return response
===== Technique 4: Context Window Management =====
Strategies to reduce tokens sent per request:
* **Sliding window context:** Keep only the last N messages instead of full history
* **Summarize old context:** Compress conversation history into summaries
* **Selective RAG:** Retrieve only the most relevant chunks, not entire documents
* **Output token limits:** Set ''max_tokens'' to prevent verbose responses (saves 20-40% on output)
===== Technique 5: Batch API Processing =====
OpenAI's Batch API offers **50% discount** for non-latency-sensitive workloads. Process overnight analytics, bulk classification, and embedding generation at half cost.
===== Combined Savings: Real Case Study =====
**Customer support chatbot (100K requests/month):**
^ Strategy ^ Before ^ After ^ Savings ^
| Model routing (80% to mini) | $4,200/mo | $1,260/mo | 70% |
| + Semantic caching (45% hits) | $1,260/mo | $693/mo | 45% |
| + Prompt compression (40%) | $693/mo | $416/mo | 40% |
| **Combined** | **$4,200/mo** | **$416/mo** | **90%** |
===== Decision Framework =====
graph TD
A[Start: High Token Costs] --> B{Query repetition > 20%?}
B -->|Yes| C[Implement Semantic Cache]
B -->|No| D{Mixed complexity queries?}
C --> D
D -->|Yes| E[Add Model Router]
D -->|No| F{Long prompts or contexts?}
E --> F
F -->|Yes| G[Add Prompt Compression]
F -->|No| H{Non-realtime workloads?}
G --> H
H -->|Yes| I[Use Batch API]
H -->|No| J[Optimize Context Windows]
I --> K[Monitor and Iterate]
J --> K
===== See Also =====
* [[caching_strategies_for_agents|Caching Strategies for Agents]]
* [[how_to_speed_up_agents|How to Speed Up Agents]]
* [[what_is_an_ai_agent|What is an AI Agent]]
===== References =====