Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Test-time compute scaling (also called inference-time scaling or TTS) refers to techniques that allocate additional computational resources during inference to improve LLM reasoning and output quality, rather than relying solely on increased pretraining scale. By allowing models to “think longer” at inference time, smaller models can match or exceed the performance of much larger ones on complex tasks.
Traditional scaling laws focus on pretraining: more parameters, more data, more FLOPs. Test-time compute scaling introduces a complementary axis – scaling compute at inference. The key insight from Snell et al. (arXiv:2408.03314) and subsequent work (arXiv:2501.02497) is that there exist compute-optimal strategies for how to spend inference FLOPs, analogous to Chinchilla-optimal training.
The inference-to-pretraining token ratio $R = \frac{\text{inference tokens}}{\text{pretraining tokens}}$ determines which strategy dominates:
Generate $N$ candidate responses in parallel, then select the highest-scoring one using a verifier (typically a Process Reward Model). The expected quality scales as:
$$\mathbb{E}\!\left[\max_{i=1}^{N} r(y_i)\right] \geq \mathbb{E}[r(y)]$$
with diminishing marginal returns as $N$ increases. This provides broad coverage but is compute-intensive for large $N$ and less effective on difficult prompts compared to adaptive methods.
# Simplified best-of-N sampling import numpy as np def best_of_n(model, verifier, prompt, n=16): candidates = [model.generate(prompt) for _ in range(n)] scores = [verifier.score(prompt, c) for c in candidates] return candidates[np.argmax(scores)]
Maintain a beam of top-$k$ candidate reasoning paths (chains-of-thought), iteratively expanding and pruning based on Process Reward Model scores. This sequential refinement outperforms best-of-N by focusing compute where it matters most:
At each step $t$, the beam retains the top-$k$ partial trajectories by cumulative PRM score:
$$\mathcal{B}_t = \text{top-}k\!\left\{\tau_{1:t} : \sum_{i=1}^{t} r(s_i, a_i)\right\}$$
Beam search achieves 4x better efficiency than best-of-N baselines in FLOPs-matched comparisons.
The compute-optimal approach estimates prompt difficulty (e.g., via pass@1 rate $p$) and allocates compute adaptively. The optimal number of samples $N^*$ for a given compute budget $C$ satisfies:
$$N^*(p, C) = \arg\max_N \; P(\text{at least one correct} \mid N) = \arg\max_N \; \left[1 - (1-p)^N\right] \quad \text{s.t.} \; N \cdot c_{\text{gen}} \leq C$$
where $c_{\text{gen}}$ is the cost per generation. This yields:
This adaptive allocation yields dramatically better efficiency than uniform compute budgets across all prompts.
OpenAI's o1 and o3 models represent the state-of-the-art in internal test-time scaling. These models are trained (via RL) to produce extended intermediate reasoning before generating final answers. Key properties:
DeepSeek-R1 achieves similar capabilities using cold-start fine-tuning combined with RL on structured reasoning data, with successful distillation – 7B models trained on R1 outputs beat 32B predecessors.
Key empirical findings from large-scale comparisons (arXiv:2512.02008, 30B+ tokens, 8 LLMs from 7B-235B):