Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Reducing token costs is one of the most impactful optimizations for LLM-powered applications. Production teams report 50-85% cost reductions by layering techniques like prompt compression, semantic caching, and intelligent model routing. This guide covers proven strategies with real numbers.
Every API call to an LLM is billed by token count. A single GPT-4o request processing a 10-page document can cost $0.05-0.15. At scale (100K+ requests/month), this compounds to thousands of dollars monthly. The key insight: most of those tokens are wasted.
Understanding the pricing tiers is essential for cost optimization:
| Provider | Model | Input ($/M tokens) ^ Output ($/M tokens) | Notes |
|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 | Flagship reasoning |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | 16x cheaper than flagship |
| Anthropic | Claude Sonnet 4 | $3.00 | $15.00 | Strong reasoning |
| Anthropic | Claude Haiku 3.5 | $0.25 | $1.25 | Fast, budget tier |
| Gemini 2.5 Pro | $2.00 | $8.00 | Long context strength | |
| Gemini 2.0 Flash | $0.10 | $0.40 | Cheapest major model |
Prices verified February 2026. Always check provider docs for current rates.
LLMLingua (Microsoft Research) compresses prompts by removing redundant tokens while preserving semantic meaning.
Measured Results:
from llmlingua import PromptCompressor compressor = PromptCompressor( model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", device_map="cpu" ) def compress_and_query(prompt, context, question, target_ratio=0.5): compressed = compressor.compress_prompt( context=[context], instruction=prompt, question=question, rate=target_ratio, condition_in_question="after" ) original_tokens = compressed["origin_tokens"] compressed_tokens = compressed["compressed_tokens"] savings_pct = (1 - compressed_tokens / original_tokens) * 100 print(f"Tokens: {original_tokens} -> {compressed_tokens} ({savings_pct:.1f}% saved)") return compressed["compressed_prompt"]
Route queries to the cheapest model capable of handling them. Production deployments show 70-80% of queries can be handled by budget models.
import openai from enum import Enum class ModelTier(Enum): BUDGET = "gpt-4o-mini" # $0.15/M input STANDARD = "gpt-4o" # $2.50/M input PREMIUM = "claude-opus-4" # $15.00/M input class ModelRouter: COMPLEXITY_SIGNALS = { "simple": ["summarize", "translate", "extract", "list", "format"], "complex": ["analyze", "reason", "compare", "evaluate", "multi-step"] } def classify(self, query: str) -> ModelTier: query_lower = query.lower() if any(sig in query_lower for sig in self.COMPLEXITY_SIGNALS["complex"]): return ModelTier.STANDARD return ModelTier.BUDGET async def route(self, query: str, max_tier: ModelTier = ModelTier.PREMIUM): tier = self.classify(query) try: return await self._call_model(tier.value, query) except Exception: tiers = list(ModelTier) current_idx = tiers.index(tier) if current_idx + 1 < len(tiers): return await self._call_model(tiers[current_idx + 1].value, query) raise async def _call_model(self, model: str, query: str): client = openai.AsyncOpenAI() return await client.chat.completions.create( model=model, messages=[{"role": "user", "content": query}] )
Published results from RouteLLM: up to 85% cost reduction without quality loss by routing 60% of simple queries to budget models.
Cache responses for semantically similar queries. Production systems report 20-45% cache hit rates, eliminating LLM calls entirely for those requests.
from redisvl.extensions.llmcache import SemanticCache cache = SemanticCache( name="llm_cache", redis_url="redis://localhost:6379", distance_threshold=0.15 # Lower = stricter matching ) def cached_query(prompt: str) -> str: results = cache.check(prompt=prompt) if results: return results[0]["response"] response = call_llm(prompt) cache.store(prompt=prompt, response=response, metadata={"model": "gpt-4o-mini"}) return response
Strategies to reduce tokens sent per request:
max_tokens to prevent verbose responses (saves 20-40% on output)OpenAI's Batch API offers 50% discount for non-latency-sensitive workloads. Process overnight analytics, bulk classification, and embedding generation at half cost.
Customer support chatbot (100K requests/month):
| Strategy | Before | After | Savings |
|---|---|---|---|
| Model routing (80% to mini) | $4,200/mo | $1,260/mo | 70% | |
| + Semantic caching (45% hits) | $1,260/mo | $693/mo | 45% | |
| + Prompt compression (40%) | $693/mo | $416/mo | 40% | |
| Combined | $4,200/mo | $416/mo | 90% |