====== How to Reduce Token Costs ====== Reducing token costs is one of the most impactful optimizations for LLM-powered applications. Production teams report **50-85% cost reductions** by layering techniques like prompt compression, semantic caching, and intelligent model routing. This guide covers proven strategies with real numbers.(([[https://redis.io/blog/llm-token-optimization-speed-up-apps/|LLM Token Optimization]]))(([[https://blog.premai.io/llm-cost-optimization-8-strategies-that-cut-api-spend-by-80-2026-guide/|8 Strategies That Cut API Spend by 80%]]))(([[https://www.glukhov.org/post/2025/11/cost-effective-llm-applications/|Cost-Effective LLM Applications]]))(([[https://www.pluralsight.com/resources/blog/ai-and-data/how-cut-llm-costs-with-metering|How to Cut LLM Costs with Metering]])) ===== The Token Cost Problem ===== Every API call to an LLM is billed by token count. A single GPT-4o request processing a 10-page document can cost $0.05-0.15. At scale (100K+ requests/month), this compounds to thousands of dollars monthly. The key insight: most of those tokens are wasted. ===== Current API Pricing Landscape (2026) ===== Understanding the pricing tiers is essential for cost optimization: ^ Provider ^ Model ^ Input ($/M tokens) ^ Output ($/M tokens) ^ Notes ^ | OpenAI | GPT-4o | $2.50 | $10.00 | Flagship reasoning | | OpenAI | GPT-4o-mini | $0.15 | $0.60 | 16x cheaper than flagship | | Anthropic | Claude Sonnet 4 | $3.00 | $15.00 | Strong reasoning | | Anthropic | Claude Haiku 3.5 | $0.25 | $1.25 | Fast, budget tier | | Google | Gemini 2.5 Pro | $2.00 | $8.00 | Long context strength | | Google | Gemini 2.0 Flash | $0.10 | $0.40 | Cheapest major model | //Prices verified February 2026. Always check provider docs for current rates.// ===== Technique 1: Prompt Compression ===== **LLMLingua** (Microsoft Research) compresses prompts by removing redundant tokens while preserving semantic meaning.(([[https://arxiv.org/abs/2310.05736|LLMLingua: Compressing Prompts for Accelerated Inference]])) **Measured Results:** * 30-40% token reduction with minimal performance loss * LongLLMLingua achieves up to **10x compression** on long contexts * 90%+ task performance retention after compression * Direct translation to 30-40% cost savings on input tokens from llmlingua import PromptCompressor compressor = PromptCompressor( model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", device_map="cpu" ) def compress_and_query(prompt, context, question, target_ratio=0.5): compressed = compressor.compress_prompt( context=[context], instruction=prompt, question=question, rate=target_ratio, condition_in_question="after" ) original_tokens = compressed["origin_tokens"] compressed_tokens = compressed["compressed_tokens"] savings_pct = (1 - compressed_tokens / original_tokens) * 100 print(f"Tokens: {original_tokens} -> {compressed_tokens} ({savings_pct:.1f}% saved)") return compressed["compressed_prompt"] ===== Technique 2: Model Routing ===== Route queries to the cheapest model capable of handling them. Production deployments show **70-80% of queries** can be handled by budget models. graph TD A[Incoming Query] --> B[Complexity Classifier] B -->|Simple 70%| C["GPT-4o-mini $0.15/M input"] B -->|Medium 25%| D["GPT-4o $2.50/M input"] B -->|Complex 5%| E["Claude Opus $15.00/M input"] C --> F[Response] D --> F E --> F F --> G{Quality Check} G -->|Pass| H[Return Response] G -->|Fail| I[Escalate to Next Tier] I --> B import openai from enum import Enum class ModelTier(Enum): BUDGET = "gpt-4o-mini" # $0.15/M input STANDARD = "gpt-4o" # $2.50/M input PREMIUM = "claude-opus-4" # $15.00/M input class ModelRouter: COMPLEXITY_SIGNALS = { "simple": ["summarize", "translate", "extract", "list", "format"], "complex": ["analyze", "reason", "compare", "evaluate", "multi-step"] } def classify(self, query: str) -> ModelTier: query_lower = query.lower() if any(sig in query_lower for sig in self.COMPLEXITY_SIGNALS["complex"]): return ModelTier.STANDARD return ModelTier.BUDGET async def route(self, query: str, max_tier: ModelTier = ModelTier.PREMIUM): tier = self.classify(query) try: return await self._call_model(tier.value, query) except Exception: tiers = list(ModelTier) current_idx = tiers.index(tier) if current_idx + 1 < len(tiers): return await self._call_model(tiers[current_idx + 1].value, query) raise async def _call_model(self, model: str, query: str): client = openai.AsyncOpenAI() return await client.chat.completions.create( model=model, messages=[{"role": "user", "content": query}] ) **Published results from RouteLLM:** up to **85% cost reduction** without quality loss by routing 60% of simple queries to budget models. ===== Technique 3: Semantic Caching ===== Cache responses for semantically similar queries. Production systems report **20-45% cache hit rates**, eliminating LLM calls entirely for those requests. from redisvl.extensions.llmcache import SemanticCache cache = SemanticCache( name="llm_cache", redis_url="redis://localhost:6379", distance_threshold=0.15 # Lower = stricter matching ) def cached_query(prompt: str) -> str: results = cache.check(prompt=prompt) if results: return results[0]["response"] response = call_llm(prompt) cache.store(prompt=prompt, response=response, metadata={"model": "gpt-4o-mini"}) return response ===== Technique 4: Context Window Management ===== Strategies to reduce tokens sent per request: * **Sliding window context:** Keep only the last N messages instead of full history * **Summarize old context:** Compress conversation history into summaries * **Selective RAG:** Retrieve only the most relevant chunks, not entire documents * **Output token limits:** Set ''max_tokens'' to prevent verbose responses (saves 20-40% on output) ===== Technique 5: Batch API Processing ===== OpenAI's Batch API offers **50% discount** for non-latency-sensitive workloads. Process overnight analytics, bulk classification, and embedding generation at half cost. ===== Combined Savings: Real Case Study ===== **Customer support chatbot (100K requests/month):** ^ Strategy ^ Before ^ After ^ Savings ^ | Model routing (80% to mini) | $4,200/mo | $1,260/mo | 70% | | + Semantic caching (45% hits) | $1,260/mo | $693/mo | 45% | | + Prompt compression (40%) | $693/mo | $416/mo | 40% | | **Combined** | **$4,200/mo** | **$416/mo** | **90%** | ===== Decision Framework ===== graph TD A[Start: High Token Costs] --> B{Query repetition > 20%?} B -->|Yes| C[Implement Semantic Cache] B -->|No| D{Mixed complexity queries?} C --> D D -->|Yes| E[Add Model Router] D -->|No| F{Long prompts or contexts?} E --> F F -->|Yes| G[Add Prompt Compression] F -->|No| H{Non-realtime workloads?} G --> H H -->|Yes| I[Use Batch API] H -->|No| J[Optimize Context Windows] I --> K[Monitor and Iterate] J --> K ===== See Also ===== * [[caching_strategies_for_agents|Caching Strategies for Agents]] * [[how_to_speed_up_agents|How to Speed Up Agents]] * [[what_is_an_ai_agent|What is an AI Agent]] ===== References =====