Table of Contents

Test-Time Compute Scaling

Test-time compute scaling (also called inference-time scaling or TTS) refers to techniques that allocate additional computational resources during inference to improve LLM reasoning and output quality, rather than relying solely on increased pretraining scale. By allowing models to “think longer” at inference time, smaller models can match or exceed the performance of much larger ones on complex tasks.

Background and Motivation

Traditional scaling laws focus on pretraining: more parameters, more data, more FLOPs. Test-time compute scaling introduces a complementary axis – scaling compute at inference. The key insight from Snell et al. (arXiv:2408.03314) and subsequent work (arXiv:2501.02497) is that there exist compute-optimal strategies for how to spend inference FLOPs, analogous to Chinchilla-optimal training.

The inference-to-pretraining token ratio $R = \frac{\text{inference tokens}}{\text{pretraining tokens}}$ determines which strategy dominates:

Core Techniques

Best-of-N Sampling

Generate $N$ candidate responses in parallel, then select the highest-scoring one using a verifier (typically a Process Reward Model). The expected quality scales as:

$$\mathbb{E}\!\left[\max_{i=1}^{N} r(y_i)\right] \geq \mathbb{E}[r(y)]$$

with diminishing marginal returns as $N$ increases. This provides broad coverage but is compute-intensive for large $N$ and less effective on difficult prompts compared to adaptive methods.

# Simplified best-of-N sampling
import numpy as np
 
def best_of_n(model, verifier, prompt, n=16):
    candidates = [model.generate(prompt) for _ in range(n)]
    scores = [verifier.score(prompt, c) for c in candidates]
    return candidates[np.argmax(scores)]

Beam Search over Thoughts

Maintain a beam of top-$k$ candidate reasoning paths (chains-of-thought), iteratively expanding and pruning based on Process Reward Model scores. This sequential refinement outperforms best-of-N by focusing compute where it matters most:

  1. Generate initial candidates
  2. Select top 2-4 based on PRM scores after first reasoning step
  3. Expand each, rescore, prune again
  4. Repeat until completion

At each step $t$, the beam retains the top-$k$ partial trajectories by cumulative PRM score:

$$\mathcal{B}_t = \text{top-}k\!\left\{\tau_{1:t} : \sum_{i=1}^{t} r(s_i, a_i)\right\}$$

Beam search achieves 4x better efficiency than best-of-N baselines in FLOPs-matched comparisons.

Internal vs External Scaling

Compute-Optimal Strategies

The compute-optimal approach estimates prompt difficulty (e.g., via pass@1 rate $p$) and allocates compute adaptively. The optimal number of samples $N^*$ for a given compute budget $C$ satisfies:

$$N^*(p, C) = \arg\max_N \; P(\text{at least one correct} \mid N) = \arg\max_N \; \left[1 - (1-p)^N\right] \quad \text{s.t.} \; N \cdot c_{\text{gen}} \leq C$$

where $c_{\text{gen}}$ is the cost per generation. This yields:

This adaptive allocation yields dramatically better efficiency than uniform compute budgets across all prompts.

The o1/o3 Approach

OpenAI's o1 and o3 models represent the state-of-the-art in internal test-time scaling. These models are trained (via RL) to produce extended intermediate reasoning before generating final answers. Key properties:

DeepSeek-R1 achieves similar capabilities using cold-start fine-tuning combined with RL on structured reasoning data, with successful distillation – 7B models trained on R1 outputs beat 32B predecessors.

Scaling Laws at Inference

Key empirical findings from large-scale comparisons (arXiv:2512.02008, 30B+ tokens, 8 LLMs from 7B-235B):

References

See Also