Table of Contents

Reasoning Reward Models

Reasoning reward models evaluate the quality of AI-generated reasoning processes, providing training signals that guide language models toward sound, step-by-step reasoning rather than rewarding only correct final answers. They are a critical component in reinforcement learning from human feedback (RLHF) pipelines and increasingly in agentic AI systems where multi-step decision quality determines task success.

Outcome vs Process Reward Models

The two fundamental approaches to rewarding reasoning differ in what they evaluate:

Outcome Reward Models (ORMs):

Process Reward Models (PRMs):

OpenAI's foundational research demonstrated that PRMs select correct solutions more effectively than ORMs across varying sample sizes on MATH problems, establishing PRMs as the preferred approach for reasoning-heavy tasks.

Combined Reward Signals

Recent research integrates outcome and process signals for more robust training. A combined reward function blends both signals:

$$R_{\text{combined}}(\tau) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{process}}(s_t, a_t)$$

where $\alpha \in [0,1]$ controls the balance between outcome and process supervision.

These combined approaches outperform pure outcome-based or pure process-based methods, achieving superior PRM@N scores (selecting correct solutions from $N$ candidates) while requiring less human annotation.

Training Methodologies

Training effective reasoning reward models involves several key techniques:

Data Generation:

Training Approaches:

$$\mathcal{L}_{\text{step}}(\phi) = -\sum_{t=1}^{T} \left[y_t \log r_\phi(s_t) + (1-y_t)\log(1-r_\phi(s_t))\right]$$

where $y_t \in \{0, 1\}$ is the correctness label for step $t$

Scaling Laws:

Application to Agents

Reasoning reward models are increasingly central to agentic AI systems:

# Example: Process Reward Model for step-level evaluation
class ProcessRewardModel:
    def __init__(self, base_model, tokenizer):
        self.model = base_model   # Fine-tuned for step scoring
        self.tokenizer = tokenizer
 
    def score_trajectory(self, problem, steps):
        """Score each reasoning step in a trajectory."""
        step_scores = []
        context = f"Problem: {problem}\n"
        for step in steps:
            context += f"Step: {step}\n"
            inputs = self.tokenizer(context, return_tensors="pt")
            score = self.model(**inputs).logits  # Binary: correct/incorrect
            step_scores.append(score.item())
        return step_scores
 
    def select_best_trajectory(self, problem, trajectories):
        """Select the best reasoning chain from N candidates."""
        scored = []
        for trajectory in trajectories:
            scores = self.score_trajectory(problem, trajectory.steps)
            # Use minimum step score as trajectory quality
            # (one bad step invalidates the chain)
            min_score = min(scores)
            scored.append((min_score, trajectory))
        return max(scored, key=lambda x: x[0])[1]

References

See Also