====== Inference Economics ======

Inference economics is the study of the **costs, pricing models, and business dynamics** of running trained AI models in production. While training a model is a one-time investment, inference — generating responses to user queries — is an ongoing operational expense that scales with usage and increasingly dominates AI budgets. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]]))

===== What Makes Inference Expensive =====

Three primary factors drive inference cost:

**GPU compute**: Generating output tokens is inherently sequential — each token depends on all previous tokens. This autoregressive decoding cannot be fully parallelized, requiring sustained GPU utilization throughout generation. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]]))

**Memory**: Model weights must reside in GPU memory during inference. A 70-billion-parameter model in FP16 precision requires approximately 140GB of GPU HBM — often spanning multiple GPUs. Additionally, the **KV cache** (storing attention state for prior tokens) grows with context length, consuming memory proportional to the [[llm_context_window|context window]] size. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]]))

**Energy**: High-end GPUs like the NVIDIA H100 cost $2.85-$3.50 per hour in cloud deployments, and inference workloads keep them running continuously. Energy costs compound at scale.

===== Per-Token Pricing =====

API providers charge by the token, with **output tokens costing 3-5x more** than input tokens. This reflects the computational asymmetry: input tokens are processed in parallel (prefill phase), while output tokens are generated one at a time (decode phase).

Representative pricing as of late 2025:

| Model | Input (per 1M tokens) | Output (per 1M tokens) |
| Gemini Flash-Lite (Google) | $0.075 | $0.30 |
| Llama 3.2 3B (Together.ai) | $0.06 | $0.06 |
| DeepSeek R1 | $0.55 | $2.19 |
| Claude Sonnet 4 (Anthropic) | $3.00 | $15.00 |
| GPT-4 class models | ~$2.50 | ~$10.00 |

Prices have fallen roughly **10x per year** for equivalent performance levels since 2022, driven by competition, optimization, and hardware improvements. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]]))

===== How Model Size Affects Cost =====

Larger models are proportionally more expensive to run:

  * They require more GPU memory (and often more GPUs)
  * Each forward pass involves more computation
  * They generate tokens more slowly

This creates a strong incentive to use the **smallest model that meets quality requirements** for each task — a principle that drives the popularity of [[lora_adapter|LoRA adapters]] and distilled models.

===== Optimization Techniques =====

The industry employs several techniques to reduce inference costs:

  * **Quantization**: Reducing numerical precision (e.g., FP16 to INT4/FP8) cuts memory use and speeds computation. NVIDIA Blackwell GPUs enable up to 4x throughput gains through native low-precision support. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]]))
  * **Distillation**: Training smaller "student" models to mimic larger "teacher" models, achieving most of the quality at a fraction of the cost.
  * **Speculative decoding**: Using a fast draft model to predict multiple tokens ahead, then verifying with the full model — reducing the number of expensive full-model forward passes.
  * **Batching**: Grouping multiple requests together to improve GPU utilization. Continuous batching dynamically adds requests to in-flight batches.
  * **Prompt caching**: Reusing KV cache entries for common prompt prefixes, dramatically reducing input processing costs for repeated system prompts.
  * **Sparse attention**: Skipping unnecessary attention computations for long contexts, achieving 40-60% savings on [[million_token_context_window|million-token]] workloads. ((Source: [[https://sjramblings.io/inference-tax-nobody-budgeted-for/|SJ Ramblings - Inference Tax]]))

===== The Business Economics =====

AI inference is a challenging business. OpenAI reportedly spent $1.35 for every $1 earned in 2025, with GPU costs outpacing API revenue. ((Source: [[https://aiautomationglobal.com/blog/ai-inference-cost-crisis-openai-economics-2026|AI Automation Global - Inference Cost Crisis]]))

Key dynamics include:

  * **Cross-subsidization**: Cloud providers and investors subsidize current pricing to build market share
  * **Price wars**: Competition from Chinese providers (DeepSeek) and open-weights models drives aggressive undercutting
  * **Commoditization**: Budget-tier models now cost less than $0.10 per million tokens, compressing margins
  * **Inference share growing**: Inference now represents over 55% of total AI compute spend, up from 33% in 2023 ((Source: [[https://sjramblings.io/inference-tax-nobody-budgeted-for/|SJ Ramblings - Inference Tax]]))

===== The Inference Demand Paradox =====

While per-token costs fall by 10x annually, **demand is growing by over 300%**, causing total inference spending to increase even as unit prices drop. This "inference famine" dynamic means organizations often find their AI bills rising despite cheaper per-unit costs. ((Source: [[https://sjramblings.io/inference-tax-nobody-budgeted-for/|SJ Ramblings - Inference Tax]]))

===== Trajectory =====

Hardware improvements (3nm chips, custom AI silicon like AWS Trainium, Google TPU v6e), software optimizations, and competition will continue driving per-token costs down. However, the shift toward [[reasoning_on_tap|reasoning models]] that consume more inference compute per query may partially offset these gains. The net trajectory is toward cheaper but dramatically more prevalent AI inference. ((Source: [[https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide|Introl - Inference Unit Economics]]))

===== See Also =====

  * [[reasoning_on_tap|Reasoning-on-Tap]]
  * [[million_token_context_window|Value of 1-Million-Token Context Windows]]
  * [[lora_adapter|What Is a LoRA Adapter]]
  * [[post_training_rl_vs_scaling|Post-Training RL vs Model Scaling]]

===== References =====