Reasoning Reward Models
Reasoning reward models evaluate the quality of AI-generated reasoning processes, providing training signals that guide language models toward sound, step-by-step reasoning rather than rewarding only correct final answers. They are a critical component in reinforcement learning from human feedback (RLHF) pipelines and increasingly in agentic AI systems where multi-step decision quality determines task success.
Outcome vs Process Reward Models
The two fundamental approaches to rewarding reasoning differ in what they evaluate:
Outcome Reward Models (ORMs):
Assign a single scalar reward $R(x, y)$ based solely on the final answer $y$ to prompt $x$
Optimize end-to-end accuracy via methods like RLHF
Computationally efficient — only one evaluation per trajectory
Risk rewarding flawed reasoning that happens to produce correct answers
Provide sparse reward signals that make credit assignment difficult
Cannot distinguish between lucky guesses and sound reasoning
Process Reward Models (PRMs):
Score each intermediate reasoning step independently: $r(s_t, a_t)$ for step $t$
Provide dense, granular feedback on the full reasoning trajectory
Directly supervise aligned reasoning chains, rewarding correctness at each step
Produce interpretable outputs — evaluators can identify exactly where reasoning goes wrong
Consistently outperform ORMs in mathematical reasoning benchmarks (e.g., MATH dataset)
More expensive to train due to step-level annotation requirements
OpenAI's foundational research demonstrated that PRMs select correct solutions more effectively than ORMs across varying sample sizes on MATH problems, establishing PRMs as the preferred approach for reasoning-heavy tasks.
Combined Reward Signals
Recent research integrates outcome and process signals for more robust training. A combined reward function blends both signals:
$$R_{\text{combined}}(\tau) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{process}}(s_t, a_t)$$
where $\alpha \in [0,1]$ controls the balance between outcome and process supervision.
Hybrid ORM+PRM — Process rewards guide intermediate steps while outcome rewards ensure end-to-end correctness, combining the strengths of both approaches
Auto-generated process labels — Generate multiple completions per reasoning step; label a step as positive if any completion reaches the correct final answer, correlating strongly with step correctness
Monte Carlo estimation — Estimate step-level rewards by sampling $K$ trajectories from each intermediate state $s_t$ and measuring success rates: $\hat{r}(s_t) = \frac{1}{K}\sum_{k=1}^{K}\mathbf{1}[\text{trajectory}_k \text{ succeeds}]$
ReasonRAG — In agentic retrieval-augmented generation, process-level rewards for query generation, evidence extraction, and answer synthesis combine with outcome rewards to enhance stability and efficiency
These combined approaches outperform pure outcome-based or pure process-based methods, achieving superior PRM@N scores (selecting correct solutions from $N$ candidates) while requiring less human annotation.
Training Methodologies
Training effective reasoning reward models involves several key techniques:
Data Generation:
Generate approximately 15 reasoning trajectories per problem
Sample approximately 16 completions per intermediate step
Label steps positively if any completion from that point reaches the correct answer
Use masking to predict binary rewards (positive/negative) in context of prior steps
Training Approaches:
$$\mathcal{L}_{\text{step}}(\phi) = -\sum_{t=1}^{T} \left[y_t \log r_\phi(s_t) + (1-y_t)\log(1-r_\phi(s_t))\right]$$
where $y_t \in \{0, 1\}$ is the correctness label for step $t$
Apply RewardTrainer or custom training loops with step-level loss functions
Use completion models to auto-generate high-quality supervision signals, reducing dependence on expensive human annotation
Bin solutions by step count for better supervision of varying-length reasoning chains
Scaling Laws:
Larger annotation datasets (e.g., 1M+ labeled steps) yield significantly better PRMs
Auto-generated labels from completion models produce PRMs that outperform Monte Carlo-based baselines
Signal quality scales with the diversity and capability of the completion model
Application to Agents
Reasoning reward models are increasingly central to agentic AI systems:
Agentic RAG — Process rewards optimize agent policies for autonomous search invocation, query formulation, evidence extraction, and answer synthesis, reducing computational costs and gradient conflicts compared to pure outcome-based RL
Trajectory evaluation — Agents use cumulative step-level reward scores to evaluate and revise multi-step action plans
Inference-time guidance — PRMs enable best-of-N sampling at inference time, selecting the trajectory $\tau^*$ with the highest quality: $\tau^* = \arg\max_{\tau \in \{\tau_1,\ldots,\tau_N\}} \min_{t} r(s_t, a_t)$
RLHF/PPO/DPO pipelines — Reward models provide the training signal for reinforcement learning fine-tuning of reasoning-capable models
Self-correction — Agents use step-level feedback to identify and backtrack from reasoning errors during execution
# Example: Process Reward Model for step-level evaluation
class ProcessRewardModel:
def __init__(self, base_model, tokenizer):
self.model = base_model # Fine-tuned for step scoring
self.tokenizer = tokenizer
def score_trajectory(self, problem, steps):
"""Score each reasoning step in a trajectory."""
step_scores = []
context = f"Problem: {problem}\n"
for step in steps:
context += f"Step: {step}\n"
inputs = self.tokenizer(context, return_tensors="pt")
score = self.model(**inputs).logits # Binary: correct/incorrect
step_scores.append(score.item())
return step_scores
def select_best_trajectory(self, problem, trajectories):
"""Select the best reasoning chain from N candidates."""
scored = []
for trajectory in trajectories:
scores = self.score_trajectory(problem, trajectory.steps)
# Use minimum step score as trajectory quality
# (one bad step invalidates the chain)
min_score = min(scores)
scored.append((min_score, trajectory))
return max(scored, key=lambda x: x[0])[1]
References
See Also