Reasoning Reward Models

Reasoning reward models evaluate the quality of AI-generated reasoning processes, providing training signals that guide language models toward sound, step-by-step reasoning rather than rewarding only correct final answers. They are a critical component in reinforcement learning from human feedback (RLHF) pipelines and increasingly in agentic AI systems where multi-step decision quality determines task success.

Outcome vs Process Reward Models

The two fundamental approaches to rewarding reasoning differ in what they evaluate:

Outcome Reward Models (ORMs):

Assign a single scalar reward $R(x, y)$ based solely on the final answer $y$ to prompt $x$
Optimize end-to-end accuracy via methods like RLHF
Computationally efficient — only one evaluation per trajectory
Risk rewarding flawed reasoning that happens to produce correct answers
Provide sparse reward signals that make credit assignment difficult
Cannot distinguish between lucky guesses and sound reasoning

Process Reward Models (PRMs):

Score each intermediate reasoning step independently: $r(s_t, a_t)$ for step $t$
Provide dense, granular feedback on the full reasoning trajectory
Directly supervise aligned reasoning chains, rewarding correctness at each step
Produce interpretable outputs — evaluators can identify exactly where reasoning goes wrong
Consistently outperform ORMs in mathematical reasoning benchmarks (e.g., MATH dataset)
More expensive to train due to step-level annotation requirements

OpenAI's foundational research demonstrated that PRMs select correct solutions more effectively than ORMs across varying sample sizes on MATH problems, establishing PRMs as the preferred approach for reasoning-heavy tasks.

Combined Reward Signals

Recent research integrates outcome and process signals for more robust training. A combined reward function blends both signals:

$$R_{\text{combined}}(\tau) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{process}}(s_t, a_t)$$

where $\alpha \in [0,1]$ controls the balance between outcome and process supervision.

Hybrid ORM+PRM — Process rewards guide intermediate steps while outcome rewards ensure end-to-end correctness, combining the strengths of both approaches
Auto-generated process labels — Generate multiple completions per reasoning step; label a step as positive if any completion reaches the correct final answer, correlating strongly with step correctness
Monte Carlo estimation — Estimate step-level rewards by sampling $K$ trajectories from each intermediate state $s_t$ and measuring success rates: $\hat{r}(s_t) = \frac{1}{K}\sum_{k=1}^{K}\mathbf{1}[\text{trajectory}_k \text{ succeeds}]$
ReasonRAG — In agentic retrieval-augmented generation, process-level rewards for query generation, evidence extraction, and answer synthesis combine with outcome rewards to enhance stability and efficiency

These combined approaches outperform pure outcome-based or pure process-based methods, achieving superior PRM@N scores (selecting correct solutions from $N$ candidates) while requiring less human annotation.

Training Methodologies

Training effective reasoning reward models involves several key techniques:

Data Generation:

Generate approximately 15 reasoning trajectories per problem
Sample approximately 16 completions per intermediate step
Label steps positively if any completion from that point reaches the correct answer
Use masking to predict binary rewards (positive/negative) in context of prior steps

Training Approaches:

Fine-tune base language models on step-labeled reasoning datasets using a binary cross-entropy loss at each step:

$$\mathcal{L}_{\text{step}}(\phi) = -\sum_{t=1}^{T} \left[y_t \log r_\phi(s_t) + (1-y_t)\log(1-r_\phi(s_t))\right]$$

where $y_t \in \{0, 1\}$ is the correctness label for step $t$

Apply RewardTrainer or custom training loops with step-level loss functions
Use completion models to auto-generate high-quality supervision signals, reducing dependence on expensive human annotation
Bin solutions by step count for better supervision of varying-length reasoning chains

Scaling Laws:

Larger annotation datasets (e.g., 1M+ labeled steps) yield significantly better PRMs
Auto-generated labels from completion models produce PRMs that outperform Monte Carlo-based baselines
Signal quality scales with the diversity and capability of the completion model

Application to Agents

Reasoning reward models are increasingly central to agentic AI systems:

Agentic RAG — Process rewards optimize agent policies for autonomous search invocation, query formulation, evidence extraction, and answer synthesis, reducing computational costs and gradient conflicts compared to pure outcome-based RL
Trajectory evaluation — Agents use cumulative step-level reward scores to evaluate and revise multi-step action plans
Inference-time guidance — PRMs enable best-of-N sampling at inference time, selecting the trajectory $\tau^*$ with the highest quality: $\tau^* = \arg\max_{\tau \in \{\tau_1,\ldots,\tau_N\}} \min_{t} r(s_t, a_t)$
RLHF/PPO/DPO pipelines — Reward models provide the training signal for reinforcement learning fine-tuning of reasoning-capable models
Self-correction — Agents use step-level feedback to identify and backtrack from reasoning errors during execution

# Example: Process Reward Model for step-level evaluation
class ProcessRewardModel:
    def __init__(self, base_model, tokenizer):
        self.model = base_model   # Fine-tuned for step scoring
        self.tokenizer = tokenizer
 
    def score_trajectory(self, problem, steps):
        """Score each reasoning step in a trajectory."""
        step_scores = []
        context = f"Problem: {problem}\n"
        for step in steps:
            context += f"Step: {step}\n"
            inputs = self.tokenizer(context, return_tensors="pt")
            score = self.model(**inputs).logits  # Binary: correct/incorrect
            step_scores.append(score.item())
        return step_scores
 
    def select_best_trajectory(self, problem, trajectories):
        """Select the best reasoning chain from N candidates."""
        scored = []
        for trajectory in trajectories:
            scores = self.score_trajectory(problem, trajectory.steps)
            # Use minimum step score as trajectory quality
            # (one bad step invalidates the chain)
            min_score = min(scores)
            scored.append((min_score, trajectory))
        return max(scored, key=lambda x: x[0])[1]

Table of Contents

Reasoning Reward Models

Outcome vs Process Reward Models

Combined Reward Signals

Training Methodologies

Application to Agents

References

See Also