Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
The AI field is engaged in a fundamental debate: should capability improvements come from scaling up pre-training (bigger models, more data, more compute) or from post-training reinforcement learning (RL techniques applied after pre-training to teach reasoning and alignment)? Evidence from 2024-2026 increasingly favors post-training RL as a more compute-efficient path to advanced capabilities. 1)
Scaling laws (Kaplan et al. 2020, Chinchilla 2022) established that LLM performance improves predictably with more parameters, training data, and compute. This drove the race to build ever-larger models — GPT-4, Gemini Ultra, Llama 3.1 405B.
However, by 2024-2025, signs of diminishing returns emerged. Each doubling of pre-training compute produced smaller gains. The cost of frontier pre-training runs reached hundreds of millions of dollars, with some estimates for next-generation runs exceeding a billion. The raw scaling approach was hitting economic and physical limits. 2)
Post-training RL applies reinforcement learning after the base model is pre-trained, teaching it to reason through problems step by step rather than simply predicting the next token.
Key milestones:
Several RL approaches are used in post-training:
A key insight behind reasoning models is that compute can be shifted from training time to inference time. Rather than spending billions on pre-training, reasoning models spend more compute at inference by:
This makes the system's effective intelligence scalable at deployment time — harder problems get more thinking time. 6)
The economics strongly favor post-training RL:
Post-training RL is not a universal solution:
The 2025-2026 consensus is that the most capable systems combine strong pre-trained bases with intensive post-training RL. Pure scaling of pre-training is necessary but insufficient. The highest-performing models use pre-training for broad knowledge and RL for targeted reasoning skills — a hybrid approach that is both more capable and more economical than scaling alone. 8)