====== Post-Training RL vs Model Scaling ====== The AI field is engaged in a fundamental debate: should capability improvements come from **scaling up pre-training** (bigger models, more data, more compute) or from **post-training reinforcement learning** (RL techniques applied after pre-training to teach reasoning and alignment)? Evidence from 2024-2026 increasingly favors post-training RL as a more compute-efficient path to advanced capabilities. ((Source: [[https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1|Epoch AI - What Went Into Training DeepSeek R1]])) ===== The Scaling Laws Debate ===== **Scaling laws** (Kaplan et al. 2020, Chinchilla 2022) established that LLM performance improves predictably with more parameters, training data, and compute. This drove the race to build ever-larger models — GPT-4, Gemini Ultra, Llama 3.1 405B. However, by 2024-2025, signs of **diminishing returns** emerged. Each doubling of pre-training compute produced smaller gains. The cost of frontier pre-training runs reached hundreds of millions of dollars, with some estimates for next-generation runs exceeding a billion. The raw scaling approach was hitting economic and physical limits. ((Source: [[https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1|Epoch AI - DeepSeek R1 Analysis]])) ===== The Rise of Post-Training RL ===== Post-training RL applies reinforcement learning **after** the base model is pre-trained, teaching it to reason through problems step by step rather than simply predicting the next token. Key milestones: * **OpenAI o1** (September 2024): Demonstrated that RL with chain-of-thought reasoning could achieve breakthroughs on math, science, and coding benchmarks that pure scaling had not reached. ((Source: [[https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1|Interconnects - DeepSeek R1 Recipe]])) * **DeepSeek R1** (January 2025): Matched o1 on major benchmarks (79.8% vs 79.2% on AIME math) at a fraction of the training cost, proving RL-based reasoning was reproducible and efficient. ((Source: [[https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1|Interconnects - DeepSeek R1]])) * **OpenAI o3** (2025): Extended the reasoning model paradigm with further RL refinements. ===== RL Techniques ===== Several RL approaches are used in post-training: * **RLHF** (RL from Human Feedback): Human evaluators rank model outputs; a reward model trained on these rankings guides RL optimization * **RLAIF** (RL from AI Feedback): Another AI model provides the rankings, scaling the process beyond human capacity * **RL from Verifiable Rewards**: The model is rewarded for objectively correct answers (math proofs, passing test cases), eliminating the need for human or AI judges ((Source: [[https://forum.effectivealtruism.org/posts/PPuojCCajtCWhJR4w/reinforcement-learning-a-non-technical-primer-on-o1-and|EA Forum - RL Primer]])) * **Group Relative Policy Optimization** (GRPO): Used by DeepSeek to stabilize RL training ===== Inference-Time Compute Scaling ===== A key insight behind reasoning models is that compute can be shifted from **training time** to **inference time**. Rather than spending billions on pre-training, reasoning models spend more compute at inference by: * Generating multiple candidate reasoning chains * Self-evaluating and verifying each chain * Selecting the best answer through majority voting or verification This makes the system's effective intelligence scalable at deployment time — harder problems get more thinking time. ((Source: [[https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1|Interconnects - DeepSeek R1]])) ===== Efficiency Comparison ===== The economics strongly favor post-training RL: * DeepSeek R1's RL training phases cost an estimated **$1M**, compared to tens of millions for the V3 base model pre-training * RL delivered **comparable benchmark performance** to models trained with orders of magnitude more pre-training compute * RL excels in **verifiable domains** (math, coding, logic) where correct answers provide clear reward signals ((Source: [[https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1|Epoch AI - DeepSeek R1 Analysis]])) ===== Limitations ===== Post-training RL is not a universal solution: * It works best in domains with **verifiable outcomes** (math, code, logic) where rewards are unambiguous * Creative, subjective, and open-ended tasks lack clear reward signals * RL requires a **strong base model** — the technique amplifies existing capabilities rather than creating new ones from scratch * Base model quality sets the ceiling; RL helps the model reach that ceiling efficiently ===== Current Consensus ===== The 2025-2026 consensus is that the most capable systems combine **strong pre-trained bases** with **intensive post-training RL**. Pure scaling of pre-training is necessary but insufficient. The highest-performing models use pre-training for broad knowledge and RL for targeted reasoning skills — a hybrid approach that is both more capable and more economical than scaling alone. ((Source: [[https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1|Epoch AI - DeepSeek R1 Analysis]])) ===== See Also ===== * [[reasoning_on_tap|Reasoning-on-Tap]] * [[inference_economics|Inference Economics]] * [[ai_self_verification|AI Self-Verification]] * [[lora_adapter|What Is a LoRA Adapter]] ===== References =====