The per-request pricing model refers to a billing structure for AI services, particularly code generation and assistance tools, where charges are assessed on a per-request basis rather than on resource consumption metrics such as token count or computational time. This approach represents one of several pricing paradigms that have emerged in the commercial AI tools landscape, each with distinct implications for user behavior and service sustainability.
In a per-request pricing model, users or organizations are charged a fixed or variable fee for each individual API call or interaction with an AI service, regardless of the computational resources consumed by that request. This contrasts with token-based pricing, where costs scale directly with the number of tokens processed, and subscription-based models, where users pay a flat periodic fee for unlimited access within defined limits.
The per-request approach initially appealed to service providers as a straightforward billing mechanism 1) because it created clear transaction boundaries and simplified accounting. However, the model assumes relatively consistent resource consumption across requests, an assumption that became increasingly problematic with the emergence of more complex AI workflows.
The per-request model encountered significant challenges when applied to agentic workflows—autonomous systems where AI models iteratively generate responses, evaluate outcomes, and refine approaches across multiple reasoning steps. These workflows are characterized by token-heavy computation patterns where a single logical request can consume substantially more computational resources than traditional query-response interactions.
For example, an agentic system performing research, code analysis, or decision-making may internally generate many intermediate thoughts, branch explorations, or refinement iterations within what the user perceives as a single request. Under per-request pricing, the service provider bears unbounded costs for that request while revenue remains fixed. Conversely, users lack transparency into the true computational cost of their workflows, creating misaligned incentives 2)—users may have no information about whether their request will consume 500 tokens or 50,000 tokens.
The unsustainability of per-request pricing for compute-intensive workloads stems from a fundamental mismatch between revenue structure and cost structure. When infrastructure costs scale with token consumption but revenue is fixed per request, service providers face potential losses on high-consumption requests and cannot easily adjust pricing without disrupting user expectations.
This problem intensifies as AI systems become more capable and complex. Large language models executing multi-step reasoning, retrieval-augmented generation pipelines that query knowledge bases, or tool-using agents that make multiple API calls within a single user-initiated request all exhibit token consumption patterns that were unanticipated when per-request models were originally designed.
Market evolution in AI pricing has generally moved away from pure per-request models toward more granular measurement approaches. Token-based pricing directly measures computational input and output, aligning costs with resource consumption and enabling transparent billing. Subscription models with defined rate limits or monthly token allowances provide budget predictability for users while allowing providers to amortize infrastructure costs. Some services employ hybrid models combining subscription tiers with token overage charges.
The shift reflects a broader maturation of the AI services market, where pricing mechanisms have become more sophisticated to accommodate diverse usage patterns, from light interactive use to heavy agentic processing.