This shows you the differences between two versions of the page.
| test_time_compute_scaling [2026/03/24 17:05] – Create page: Test-Time Compute Scaling with researched content agent | test_time_compute_scaling [2026/03/24 17:44] (current) – Add LaTeX math formatting for best-of-N, beam search, compute-optimal allocation agent | ||
|---|---|---|---|
| Line 7: | Line 7: | ||
| Traditional scaling laws focus on pretraining: | Traditional scaling laws focus on pretraining: | ||
| - | The inference-to-pretraining token ratio **R** determines which strategy dominates: | + | The inference-to-pretraining token ratio $R = \frac{\text{inference tokens}}{\text{pretraining tokens}}$ |
| - | * **R << | + | * $R \ll 1$ (few queries): Test-time compute excels; smaller models with heavy TTS outperform 14x larger models |
| - | * **R >> | + | * $R \gg 1$ (high-volume production): |
| ===== Core Techniques ===== | ===== Core Techniques ===== | ||
| Line 15: | Line 15: | ||
| ==== Best-of-N Sampling ==== | ==== Best-of-N Sampling ==== | ||
| - | Generate N candidate responses in parallel, then select the highest-scoring one using a verifier (typically a Process Reward Model). This provides broad coverage but is compute-intensive for large N and less effective on difficult prompts compared to adaptive methods. | + | Generate |
| + | |||
| + | $$\mathbb{E}\!\left[\max_{i=1}^{N} r(y_i)\right] \geq \mathbb{E}[r(y)]$$ | ||
| + | |||
| + | with diminishing marginal returns as $N$ increases. This provides broad coverage but is compute-intensive for large $N$ and less effective on difficult prompts compared to adaptive methods. | ||
| <code python> | <code python> | ||
| Line 29: | Line 33: | ||
| ==== Beam Search over Thoughts ==== | ==== Beam Search over Thoughts ==== | ||
| - | Maintain a beam of top-k candidate reasoning paths (chains-of-thought), | + | Maintain a beam of top-$k$ candidate reasoning paths (chains-of-thought), |
| - Generate initial candidates | - Generate initial candidates | ||
| Line 35: | Line 39: | ||
| - Expand each, rescore, prune again | - Expand each, rescore, prune again | ||
| - Repeat until completion | - Repeat until completion | ||
| + | |||
| + | At each step $t$, the beam retains the top-$k$ partial trajectories by cumulative PRM score: | ||
| + | |||
| + | $$\mathcal{B}_t = \text{top-}k\!\left\{\tau_{1: | ||
| Beam search achieves **4x better efficiency** than best-of-N baselines in FLOPs-matched comparisons. | Beam search achieves **4x better efficiency** than best-of-N baselines in FLOPs-matched comparisons. | ||
| Line 45: | Line 53: | ||
| ===== Compute-Optimal Strategies ===== | ===== Compute-Optimal Strategies ===== | ||
| - | The compute-optimal approach estimates prompt difficulty (e.g., via pass@1 rate) and allocates compute adaptively: | + | The compute-optimal approach estimates prompt difficulty (e.g., via pass@1 rate $p$) and allocates compute adaptively. The optimal number of samples $N^*$ for a given compute budget $C$ satisfies: |
| + | |||
| + | $$N^*(p, C) = \arg\max_N \; P(\text{at least one correct} \mid N) = \arg\max_N \; \left[1 - (1-p)^N\right] \quad \text{s.t.} \; N \cdot c_{\text{gen}} \leq C$$ | ||
| + | |||
| + | where $c_{\text{gen}}$ is the cost per generation. This yields: | ||
| - | * **Easy prompts**: Favor iterative self-revision with minimal overhead | + | * **Easy prompts** |
| * **Medium prompts**: Use moderate beam search with PRM guidance | * **Medium prompts**: Use moderate beam search with PRM guidance | ||
| - | * **Hard prompts**: Deploy full beam search or parallel sampling with maximum budget | + | * **Hard prompts** |
| This adaptive allocation yields dramatically better efficiency than uniform compute budgets across all prompts. | This adaptive allocation yields dramatically better efficiency than uniform compute budgets across all prompts. | ||