Test-Time Compute Scaling

Test-time compute scaling (also called inference-time scaling or TTS) refers to techniques that allocate additional computational resources during inference to improve LLM reasoning and output quality, rather than relying solely on increased pretraining scale. By allowing models to “think longer” at inference time, smaller models can match or exceed the performance of much larger ones on complex tasks.

Background and Motivation

Traditional scaling laws focus on pretraining: more parameters, more data, more FLOPs. Test-time compute scaling introduces a complementary axis – scaling compute at inference. The key insight from Snell et al. (arXiv:2408.03314) and subsequent work (arXiv:2501.02497) is that there exist compute-optimal strategies for how to spend inference FLOPs, analogous to Chinchilla-optimal training.

The inference-to-pretraining token ratio $R = \frac{\text{inference tokens}}{\text{pretraining tokens}}$ determines which strategy dominates:

$R \ll 1$ (few queries): Test-time compute excels; smaller models with heavy TTS outperform 14x larger models
$R \gg 1$ (high-volume production): Pretraining larger models is more cost-effective

Core Techniques

Best-of-N Sampling

Generate $N$ candidate responses in parallel, then select the highest-scoring one using a verifier (typically a Process Reward Model). The expected quality scales as:

$$\mathbb{E}\!\left[\max_{i=1}^{N} r(y_i)\right] \geq \mathbb{E}[r(y)]$$

with diminishing marginal returns as $N$ increases. This provides broad coverage but is compute-intensive for large $N$ and less effective on difficult prompts compared to adaptive methods.

# Simplified best-of-N sampling
import numpy as np
 
def best_of_n(model, verifier, prompt, n=16):
    candidates = [model.generate(prompt) for _ in range(n)]
    scores = [verifier.score(prompt, c) for c in candidates]
    return candidates[np.argmax(scores)]

Beam Search over Thoughts

Maintain a beam of top-$k$ candidate reasoning paths (chains-of-thought), iteratively expanding and pruning based on Process Reward Model scores. This sequential refinement outperforms best-of-N by focusing compute where it matters most:

Generate initial candidates
Select top 2-4 based on PRM scores after first reasoning step
Expand each, rescore, prune again
Repeat until completion

At each step $t$, the beam retains the top-$k$ partial trajectories by cumulative PRM score:

$$\mathcal{B}_t = \text{top-}k\!\left\{\tau_{1:t} : \sum_{i=1}^{t} r(s_i, a_i)\right\}$$

Beam search achieves 4x better efficiency than best-of-N baselines in FLOPs-matched comparisons.

Internal vs External Scaling

Internal scaling: Train models to produce longer chain-of-thought internally (e.g., via “slow thinking” tokens). OpenAI o1/o3 and DeepSeek-R1 exemplify this.
External scaling: Apply search or sampling algorithms post-training (best-of-N, beam search, MCTS).

Compute-Optimal Strategies

The compute-optimal approach estimates prompt difficulty (e.g., via pass@1 rate $p$) and allocates compute adaptively. The optimal number of samples $N^*$ for a given compute budget $C$ satisfies:

$$N^*(p, C) = \arg\max_N \; P(\text{at least one correct} \mid N) = \arg\max_N \; \left[1 - (1-p)^N\right] \quad \text{s.t.} \; N \cdot c_{\text{gen}} \leq C$$

where $c_{\text{gen}}$ is the cost per generation. This yields:

Easy prompts ($p$ high): Favor iterative self-revision with minimal overhead ($N^*$ small)
Medium prompts: Use moderate beam search with PRM guidance
Hard prompts ($p$ low): Deploy full beam search or parallel sampling with maximum budget ($N^*$ large)

This adaptive allocation yields dramatically better efficiency than uniform compute budgets across all prompts.

The o1/o3 Approach

OpenAI's o1 and o3 models represent the state-of-the-art in internal test-time scaling. These models are trained (via RL) to produce extended intermediate reasoning before generating final answers. Key properties:

Reflection: The model revisits and corrects prior reasoning steps
Exploration: Multiple solution strategies are considered internally
Self-correction: “Aha moments” where the model identifies and fixes errors

DeepSeek-R1 achieves similar capabilities using cold-start fine-tuning combined with RL on structured reasoning data, with successful distillation – 7B models trained on R1 outputs beat 32B predecessors.

Scaling Laws at Inference

Key empirical findings from large-scale comparisons (arXiv:2512.02008, 30B+ tokens, 8 LLMs from 7B-235B):

Test-time compute follows predictable scaling curves analogous to training scaling laws
Gains are task-dependent: reasoning tasks benefit most, factual recall less so
Diminishing returns set in, but the optimal frontier shifts with better verifiers
Distillation captures TTS gains efficiently for deployment

AI Agent Knowledge Base

Sidebar

Table of Contents

Test-Time Compute Scaling

Background and Motivation

Core Techniques

Best-of-N Sampling

Beam Search over Thoughts

Internal vs External Scaling

Compute-Optimal Strategies

The o1/o3 Approach

Scaling Laws at Inference

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Test-Time Compute Scaling

Background and Motivation

Core Techniques

Best-of-N Sampling

Beam Search over Thoughts

Internal vs External Scaling

Compute-Optimal Strategies

The o1/o3 Approach

Scaling Laws at Inference

References

See Also

Page Tools