====== Post-Training RL vs Model Scaling ======

The AI field is engaged in a fundamental debate: should capability improvements come from **scaling up pre-training** (bigger models, more data, more compute) or from **post-training reinforcement learning** (RL techniques applied after pre-training to teach reasoning and alignment)? Evidence from 2024-2026 increasingly favors post-training RL as a more compute-efficient path to advanced capabilities. ((Source: [[https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1|Epoch AI - What Went Into Training DeepSeek R1]]))

===== The Scaling Laws Debate =====

**Scaling laws** (Kaplan et al. 2020, Chinchilla 2022) established that LLM performance improves predictably with more parameters, training data, and compute. This drove the race to build ever-larger models — GPT-4, Gemini Ultra, Llama 3.1 405B.

However, by 2024-2025, signs of **diminishing returns** emerged. Each doubling of pre-training compute produced smaller gains. The cost of frontier pre-training runs reached hundreds of millions of dollars, with some estimates for next-generation runs exceeding a billion. The raw scaling approach was hitting economic and physical limits. ((Source: [[https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1|Epoch AI - DeepSeek R1 Analysis]]))

===== The Rise of Post-Training RL =====

Post-training RL applies reinforcement learning **after** the base model is pre-trained, teaching it to reason through problems step by step rather than simply predicting the next token.

Key milestones:

  * **OpenAI o1** (September 2024): Demonstrated that RL with chain-of-thought reasoning could achieve breakthroughs on math, science, and coding benchmarks that pure scaling had not reached. ((Source: [[https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1|Interconnects - DeepSeek R1 Recipe]]))
  * **DeepSeek R1** (January 2025): Matched o1 on major benchmarks (79.8% vs 79.2% on AIME math) at a fraction of the training cost, proving RL-based reasoning was reproducible and efficient. ((Source: [[https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1|Interconnects - DeepSeek R1]]))
  * **OpenAI o3** (2025): Extended the reasoning model paradigm with further RL refinements.

===== RL Techniques =====

Several RL approaches are used in post-training:

  * **RLHF** (RL from Human Feedback): Human evaluators rank model outputs; a reward model trained on these rankings guides RL optimization
  * **RLAIF** (RL from AI Feedback): Another AI model provides the rankings, scaling the process beyond human capacity
  * **RL from Verifiable Rewards**: The model is rewarded for objectively correct answers (math proofs, passing test cases), eliminating the need for human or AI judges ((Source: [[https://forum.effectivealtruism.org/posts/PPuojCCajtCWhJR4w/reinforcement-learning-a-non-technical-primer-on-o1-and|EA Forum - RL Primer]]))
  * **Group Relative Policy Optimization** (GRPO): Used by DeepSeek to stabilize RL training

===== Inference-Time Compute Scaling =====

A key insight behind reasoning models is that compute can be shifted from **training time** to **inference time**. Rather than spending billions on pre-training, reasoning models spend more compute at inference by:

  * Generating multiple candidate reasoning chains
  * Self-evaluating and verifying each chain
  * Selecting the best answer through majority voting or verification

This makes the system's effective intelligence scalable at deployment time — harder problems get more thinking time. ((Source: [[https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1|Interconnects - DeepSeek R1]]))

===== Efficiency Comparison =====

The economics strongly favor post-training RL:

  * DeepSeek R1's RL training phases cost an estimated **$1M**, compared to tens of millions for the V3 base model pre-training
  * RL delivered **comparable benchmark performance** to models trained with orders of magnitude more pre-training compute
  * RL excels in **verifiable domains** (math, coding, logic) where correct answers provide clear reward signals ((Source: [[https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1|Epoch AI - DeepSeek R1 Analysis]]))

===== Limitations =====

Post-training RL is not a universal solution:

  * It works best in domains with **verifiable outcomes** (math, code, logic) where rewards are unambiguous
  * Creative, subjective, and open-ended tasks lack clear reward signals
  * RL requires a **strong base model** — the technique amplifies existing capabilities rather than creating new ones from scratch
  * Base model quality sets the ceiling; RL helps the model reach that ceiling efficiently

===== Current Consensus =====

The 2025-2026 consensus is that the most capable systems combine **strong pre-trained bases** with **intensive post-training RL**. Pure scaling of pre-training is necessary but insufficient. The highest-performing models use pre-training for broad knowledge and RL for targeted reasoning skills — a hybrid approach that is both more capable and more economical than scaling alone. ((Source: [[https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1|Epoch AI - DeepSeek R1 Analysis]]))

===== See Also =====

  * [[reasoning_on_tap|Reasoning-on-Tap]]
  * [[inference_economics|Inference Economics]]
  * [[ai_self_verification|AI Self-Verification]]
  * [[lora_adapter|What Is a LoRA Adapter]]

===== References =====