Table of Contents

Budget-Aware Reasoning

Budget-aware reasoning optimizes LLM inference under token, cost, or latency constraints by dynamically allocating compute based on problem complexity, enabling token-efficient reasoning with early stopping and anytime solution delivery.

Overview

Large language models generate increasingly long reasoning traces (Chain-of-Thought, Tree-of-Thoughts, etc.) that improve accuracy but incur significant token costs. Budget-aware reasoning addresses the fundamental question: how can we achieve the best possible answer quality within a fixed computational budget? Key approaches include value tree search under budget constraints, token-budget estimation per problem, and anytime reasoning frameworks that produce improving solutions as more tokens are generated.1))

Budget-aware value tree search frames reasoning as a search problem where each node expansion costs tokens and the total budget is constrained:

<latex>\max_{\pi} \; V(\pi) \quad \text{s.t.} \quad C(\pi) \leq B</latex>

where $V(\pi)$ is the value (solution quality) of reasoning path $\pi$, $C(\pi)$ is the token cost, and $B$ is the total budget.

The value of a reasoning node $n$ at depth $d$ is estimated as:

<latex>V(n) = R(n) + \gamma \max_{c \in \text{children}(n)} V©</latex>

where $R(n)$ is the immediate quality estimate and $\gamma$ is a discount factor that trades off depth vs. breadth exploration.

Budget allocation: Rather than exploring all branches uniformly, the search allocates tokens proportionally to the estimated value of each branch:

<latex>B(n_i) = B_{remaining} \cdot \frac{V(n_i)}{\sum_j V(n_j)}</latex>

Token-Budget Estimation

The TALE framework estimates per-problem token budgets based on reasoning complexity:2))

<latex>B_{optimal}(x) = f_{estimator}(x, \text{complexity}(x))</latex>

where $x$ is the input problem and $f_{estimator}$ predicts the minimum tokens needed for a correct solution. Including a specific token budget in prompts compresses outputs while maintaining accuracy.

Anytime Reasoning with Early Stopping

Anytime reasoning produces progressively improving solutions as more tokens are generated, enabling early termination when quality is sufficient or budget is exhausted.3))

Anytime Index: Quantifies quality improvement per added token:

<latex>\text{AI} = \frac{1}{|B|} \sum_{b \in B} \frac{\Delta Q(b)}{\Delta b}</latex>

where $Q(b)$ is solution quality at budget $b$ (measured in tokens), and the index averages the marginal quality gain per token across budget increments.

Early Stopping Criterion: Monitor convergence in self-consistency and stop when marginal gains diminish:

<latex>\text{Stop when} \quad \frac{Q(b + \Delta b) - Q(b)}{Q(b)} < \epsilon</latex>

Key finding: Budget-aware self-consistency (CoT-SC) outperforms more complex strategies like multi-agent debate (MAD) or Tree-of-Thoughts (ToT) at equivalent budgets, suggesting that simpler methods with proper budget allocation are more efficient.

Code Example

import numpy as np
from dataclasses import dataclass
 
@dataclass
class ReasoningNode:
    content: str
    token_cost: int
    quality_estimate: float
    children: list = None
    depth: int = 0
 
class BudgetAwareSearch:
    def __init__(self, llm, total_budget: int, gamma: float = 0.9):
        self.llm = llm
        self.total_budget = total_budget
        self.gamma = gamma
        self.tokens_used = 0
 
    def search(self, problem: str) -> str:
        root = self.expand_node(problem, depth=0)
        return self.best_path(root).content
 
    def expand_node(self, context: str, depth: int) -> ReasoningNode:
        remaining = self.total_budget - self.tokens_used
        if remaining <= 0 or depth > 5:
            return ReasoningNode(context, 0, self.evaluate(context))
 
        candidates = self.llm.generate_candidates(context, n=3)
        self.tokens_used += sum(c.token_cost for c in candidates)
 
        budget_per_child = self.allocate_budget(
            candidates, remaining - sum(c.token_cost for c in candidates)
        )
        for i, candidate in enumerate(candidates):
            if budget_per_child[i] > 0:
                candidate.children = [
                    self.expand_node(candidate.content, depth + 1)
                ]
        return max(candidates, key=lambda c: self.node_value(c))
 
    def allocate_budget(self, nodes, remaining):
        values = np.array([n.quality_estimate for n in nodes])
        values = np.maximum(values, 1e-8)
        return (remaining * values / values.sum()).astype(int)
 
    def node_value(self, node: ReasoningNode) -> float:
        child_val = max(
            (self.node_value(c) for c in (node.children or [])),
            default=0
        )
        return node.quality_estimate + self.gamma * child_val
 
    def anytime_reason(self, problem: str,
                       epsilon: float = 0.01) -> str:
        best_answer, prev_quality = "", 0.0
        for budget_step in range(100, self.total_budget, 100):
            answer = self.llm.generate(
                problem, max_tokens=budget_step
            )
            quality = self.evaluate(answer)
            improvement = (quality - prev_quality) / max(prev_quality, 1e-8)
            if improvement < epsilon and prev_quality > 0:
                break
            best_answer, prev_quality = answer, quality
        return best_answer

Architecture

graph TD A[Input Problem] --> B[Complexity Estimator] B --> C[Token Budget Allocation] C --> D[Value Tree Search] D --> E[Node Expansion] E --> F{Budget Remaining?} F -->|Yes| G[Generate Candidate Reasoning Steps] G --> H[Estimate Node Values] H --> I[Proportional Budget Split] I --> E F -->|No| J[Select Best Path] J --> K[Solution] subgraph Anytime Reasoning L[Progressive Generation] --> M[Quality Monitor] M --> N{Marginal Gain > epsilon?} N -->|Yes| L N -->|No| O[Early Stop - Return Best] end subgraph Self-Consistency P[Sample K Reasoning Chains] --> Q[Majority Vote] Q --> R[Budget-Aware CoT-SC] end

Comparative Results

Method Budget Efficiency Accuracy Complexity
CoT-SC (budget-aware) Best at low budgets Competitive Low
Tree-of-Thoughts Poor at low budgets Good at high budgets High
Multi-Agent Debate Poor at low budgets Diminishing returns Very High
TALE token-budget Good (per-problem) Slight accuracy drop Low
Anytime + early stop Best overall Progressive improvement Medium

See Also

References

1)
https://arxiv.org/abs/2603.12634|“Budget-Aware Value Tree Search for Token-Efficient LLM Reasoning” (2026
2)
https://arxiv.org/abs/2412.18547|Han & Wang. “TALE: Token-Budget-Aware LLM Reasoning” (2024
3)
https://arxiv.org/abs/2601.11038|“Anytime Reasoning with Budget-Aware Self-Improvement” (2025