Per-Request vs Token-Based Pricing

Per-request and token-based pricing represent two distinct approaches to billing for large language model (LLM) and agentic AI services. These models reflect different underlying assumptions about resource consumption and computational cost, with meaningful implications for how developers architect AI applications.

Overview and Definitions

Per-request pricing charges a fixed amount for each API call or interaction, regardless of the complexity, length, or computational intensity of that request. This model treats each invocation as a discrete billing unit, simplifying cost prediction but potentially misaligning charges with actual resource consumption ¹⁾.

Token-based pricing charges according to the number of tokens processed, typically measured separately for input tokens (prompt) and output tokens (completion). This approach reflects the actual computational work performed by transformer-based models, where inference cost correlates directly with token quantity and model size ²⁾.

Historical Context and Evolution

Per-request pricing emerged as a simpler billing mechanism in early commercial LLM offerings, providing straightforward cost estimation for applications with predictable interaction patterns. However, this approach proved problematic for agentic workflows—autonomous systems that perform multiple reasoning steps, tool calls, and iterative refinements within a single logical request.

GitHub Copilot historically employed per-request pricing, representing a unique approach among agent-based services. The shift toward token-based models reflects the industry's broader recognition that agent-based systems exhibit fundamentally different consumption patterns than single-turn interactions ³⁾.

Agentic Workflows and Computational Complexity

Agentic systems—AI applications that autonomously plan, execute tool calls, and iterate toward task completion—generate substantially higher token consumption than traditional request-response interactions. A single user-level request may trigger multiple internal LLM calls for reasoning, tool invocation decisions, result interpretation, and error recovery. Per-request billing fails to capture this complexity ⁴⁾.

Token-based pricing aligns financial incentives with actual computational resource consumption, as each reasoning step, tool interaction, and refinement cycle consumes additional tokens that directly impact infrastructure costs. This reflects the underlying economics of LLM inference, where computational requirements scale primarily with token quantity rather than request count ⁵⁾.

Comparison of Billing Models

Predictability: Per-request pricing offers simpler cost forecasting for developers unfamiliar with token economies. Token-based pricing requires understanding prompt engineering efficiency and output length implications, though it provides more granular cost control ⁶⁾.

Alignment with Resources: Token-based pricing directly reflects GPU/TPU utilization, memory bandwidth, and energy consumption. Per-request pricing decouples billing from computational reality, potentially creating perverse incentives to minimize user-visible requests while maximizing internal token consumption.

Agentic Suitability: Agent-based systems with variable reasoning depth, multiple tool calls, and iterative refinement strongly favor token-based pricing. Per-request billing becomes economically inefficient when agents generate 5-50x more tokens internally than users input directly.

Pricing Transparency: Both models require developer education. Token-based systems demand understanding of input/output token ratios and reasoning overhead. Per-request models can obscure true costs for complex agent implementations.

Current Industry Practice

Major LLM providers including OpenAI, Anthropic, Google, and Meta employ token-based pricing for their commercial offerings, reflecting this model's superiority for capturing computational economics at scale. The GitHub Copilot transition from per-request to token-based usage limits (measured on per-session and weekly intervals) represents validation of this approach specifically for agent-heavy workloads ⁷⁾.

Some enterprise offerings implement hybrid approaches with minimum per-request charges combined with token overage pricing, attempting to balance simplicity with fairness. However, pure per-request pricing has largely disappeared from production AI services due to its fundamental misalignment with infrastructure costs.

Practical Implications for Developers

Developers adopting agent-based AI systems should expect token-based pricing and design accordingly. This includes minimizing unnecessary context in prompts, implementing efficient tool calling patterns, implementing caching mechanisms for repeated computations, and carefully monitoring token consumption during development phases. Token budgets—enforced limits on session or weekly consumption—create hard constraints that agents must operate within, requiring careful planning of reasoning depth and tool call frequency.