Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Reasoning reward models evaluate the quality of AI-generated reasoning processes, providing training signals that guide language models toward sound, step-by-step reasoning rather than rewarding only correct final answers. They are a critical component in reinforcement learning from human feedback (RLHF) pipelines and increasingly in agentic AI systems where multi-step decision quality determines task success.
The two fundamental approaches to rewarding reasoning differ in what they evaluate:
Outcome Reward Models (ORMs):
Process Reward Models (PRMs):
OpenAI's foundational research demonstrated that PRMs select correct solutions more effectively than ORMs across varying sample sizes on MATH problems, establishing PRMs as the preferred approach for reasoning-heavy tasks.
Recent research integrates outcome and process signals for more robust training. A combined reward function blends both signals:
$$R_{\text{combined}}(\tau) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{process}}(s_t, a_t)$$
where $\alpha \in [0,1]$ controls the balance between outcome and process supervision.
These combined approaches outperform pure outcome-based or pure process-based methods, achieving superior PRM@N scores (selecting correct solutions from $N$ candidates) while requiring less human annotation.
Training effective reasoning reward models involves several key techniques:
Data Generation:
Training Approaches:
$$\mathcal{L}_{\text{step}}(\phi) = -\sum_{t=1}^{T} \left[y_t \log r_\phi(s_t) + (1-y_t)\log(1-r_\phi(s_t))\right]$$
where $y_t \in \{0, 1\}$ is the correctness label for step $t$
Scaling Laws:
Reasoning reward models are increasingly central to agentic AI systems:
# Example: Process Reward Model for step-level evaluation class ProcessRewardModel: def __init__(self, base_model, tokenizer): self.model = base_model # Fine-tuned for step scoring self.tokenizer = tokenizer def score_trajectory(self, problem, steps): """Score each reasoning step in a trajectory.""" step_scores = [] context = f"Problem: {problem}\n" for step in steps: context += f"Step: {step}\n" inputs = self.tokenizer(context, return_tensors="pt") score = self.model(**inputs).logits # Binary: correct/incorrect step_scores.append(score.item()) return step_scores def select_best_trajectory(self, problem, trajectories): """Select the best reasoning chain from N candidates.""" scored = [] for trajectory in trajectories: scores = self.score_trajectory(problem, trajectory.steps) # Use minimum step score as trajectory quality # (one bad step invalidates the chain) min_score = min(scores) scored.append((min_score, trajectory)) return max(scored, key=lambda x: x[0])[1]