====== Inference Economics ====== Inference economics is the study of the **costs, pricing models, and business dynamics** of running trained AI models in production. While training a model is a one-time investment, inference — generating responses to user queries — is an ongoing operational expense that scales with usage and increasingly dominates AI budgets. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]])) ===== What Makes Inference Expensive ===== Three primary factors drive inference cost: **GPU compute**: Generating output tokens is inherently sequential — each token depends on all previous tokens. This autoregressive decoding cannot be fully parallelized, requiring sustained GPU utilization throughout generation. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]])) **Memory**: Model weights must reside in GPU memory during inference. A 70-billion-parameter model in FP16 precision requires approximately 140GB of GPU HBM — often spanning multiple GPUs. Additionally, the **KV cache** (storing attention state for prior tokens) grows with context length, consuming memory proportional to the [[llm_context_window|context window]] size. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]])) **Energy**: High-end GPUs like the NVIDIA H100 cost $2.85-$3.50 per hour in cloud deployments, and inference workloads keep them running continuously. Energy costs compound at scale. ===== Per-Token Pricing ===== API providers charge by the token, with **output tokens costing 3-5x more** than input tokens. This reflects the computational asymmetry: input tokens are processed in parallel (prefill phase), while output tokens are generated one at a time (decode phase). Representative pricing as of late 2025: | Model | Input (per 1M tokens) | Output (per 1M tokens) | | Gemini Flash-Lite (Google) | $0.075 | $0.30 | | Llama 3.2 3B (Together.ai) | $0.06 | $0.06 | | DeepSeek R1 | $0.55 | $2.19 | | Claude Sonnet 4 (Anthropic) | $3.00 | $15.00 | | GPT-4 class models | ~$2.50 | ~$10.00 | Prices have fallen roughly **10x per year** for equivalent performance levels since 2022, driven by competition, optimization, and hardware improvements. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]])) ===== How Model Size Affects Cost ===== Larger models are proportionally more expensive to run: * They require more GPU memory (and often more GPUs) * Each forward pass involves more computation * They generate tokens more slowly This creates a strong incentive to use the **smallest model that meets quality requirements** for each task — a principle that drives the popularity of [[lora_adapter|LoRA adapters]] and distilled models. ===== Optimization Techniques ===== The industry employs several techniques to reduce inference costs: * **Quantization**: Reducing numerical precision (e.g., FP16 to INT4/FP8) cuts memory use and speeds computation. NVIDIA Blackwell GPUs enable up to 4x throughput gains through native low-precision support. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]])) * **Distillation**: Training smaller "student" models to mimic larger "teacher" models, achieving most of the quality at a fraction of the cost. * **Speculative decoding**: Using a fast draft model to predict multiple tokens ahead, then verifying with the full model — reducing the number of expensive full-model forward passes. * **Batching**: Grouping multiple requests together to improve GPU utilization. Continuous batching dynamically adds requests to in-flight batches. * **Prompt caching**: Reusing KV cache entries for common prompt prefixes, dramatically reducing input processing costs for repeated system prompts. * **Sparse attention**: Skipping unnecessary attention computations for long contexts, achieving 40-60% savings on [[million_token_context_window|million-token]] workloads. ((Source: [[https://sjramblings.io/inference-tax-nobody-budgeted-for/|SJ Ramblings - Inference Tax]])) ===== The Business Economics ===== AI inference is a challenging business. OpenAI reportedly spent $1.35 for every $1 earned in 2025, with GPU costs outpacing API revenue. ((Source: [[https://aiautomationglobal.com/blog/ai-inference-cost-crisis-openai-economics-2026|AI Automation Global - Inference Cost Crisis]])) Key dynamics include: * **Cross-subsidization**: Cloud providers and investors subsidize current pricing to build market share * **Price wars**: Competition from Chinese providers (DeepSeek) and open-weights models drives aggressive undercutting * **Commoditization**: Budget-tier models now cost less than $0.10 per million tokens, compressing margins * **Inference share growing**: Inference now represents over 55% of total AI compute spend, up from 33% in 2023 ((Source: [[https://sjramblings.io/inference-tax-nobody-budgeted-for/|SJ Ramblings - Inference Tax]])) ===== The Inference Demand Paradox ===== While per-token costs fall by 10x annually, **demand is growing by over 300%**, causing total inference spending to increase even as unit prices drop. This "inference famine" dynamic means organizations often find their AI bills rising despite cheaper per-unit costs. ((Source: [[https://sjramblings.io/inference-tax-nobody-budgeted-for/|SJ Ramblings - Inference Tax]])) ===== Trajectory ===== Hardware improvements (3nm chips, custom AI silicon like AWS Trainium, Google TPU v6e), software optimizations, and competition will continue driving per-token costs down. However, the shift toward [[reasoning_on_tap|reasoning models]] that consume more inference compute per query may partially offset these gains. The net trajectory is toward cheaper but dramatically more prevalent AI inference. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]])) ===== See Also ===== * [[reasoning_on_tap|Reasoning-on-Tap]] * [[million_token_context_window|Value of 1-Million-Token Context Windows]] * [[lora_adapter|What Is a LoRA Adapter]] * [[post_training_rl_vs_scaling|Post-Training RL vs Model Scaling]] ===== References =====