This is an old revision of the document!

Process Reward Models

Process Reward Models (PRMs) are reward functions that assign dense, step-level scores to intermediate reasoning steps in a multi-step trajectory, enabling fine-grained supervision for tasks like mathematical reasoning, planning, and agentic decision-making.

PRM vs ORM

The fundamental distinction in reward modeling for reasoning:

Aspect	Process Reward Model (PRM)	Outcome Reward Model (ORM)
Reward granularity	Step-wise (dense)	Terminal only (sparse)
Credit assignment	Fine-grained, chain-sensitive	Outcome-only, no step attribution
Training data	Per-step correctness labels	Final answer correctness only
Reward hacking	More robust (detects bad steps)	Vulnerable (right answer, wrong reasoning)
Use cases	Reasoning verification, search guidance	Simple pass/fail evaluation

ORMs provide a single scalar reward based solely on the final result. PRMs evaluate each intermediate step, enabling early detection of reasoning errors and more informative learning signals.

Step-Level Credit Assignment

PRMs decompose rewards across trajectories, enabling precise attribution of which steps contributed to success or failure. For a trajectory tau = (s_0, a_0, s_1, …, s_T, a_T), a PRM outputs rewards r(s_t, a_t) for each step t. This addresses the fundamental credit assignment problem in multi-step reasoning.

Types of step-level reward:

Discriminative PRMs: Classify each step as correct/incorrect
Generative PRMs: Sample critiques of each step
Implicit PRMs: Derive step rewards without explicit labels (e.g., via self-consistency)
Trajectory-level aggregation: Combine step scores via sum, mean, or min operations

Math-Shepherd

Math-Shepherd trains PRMs for mathematical reasoning using binary step correctness labels (“+”/“-” tokens). The approach:

Annotate intermediate reasoning steps with correctness labels
Train the PRM with masked prediction forcing binary classification at each step
Use special token handling for reward computation
Deploy as a deterministic verifier (distinct from value models that estimate future success)

# Simplified PRM scoring for math reasoning
def score_reasoning_steps(prm, question, steps):
    """Score each reasoning step with a Process Reward Model."""
    scores = []
    context = question
    for step in steps:
        context += "\n" + step
        score = prm.predict_correctness(context)
        scores.append(score)
    # Aggregate: minimum score identifies weakest step
    return scores, min(scores)

PRM800K Dataset

The PRM800K dataset (OpenAI, 2023) contains over 800,000 labeled reasoning steps from mathematical problem-solving traces. Each step is annotated as correct, incorrect, or neutral by human labelers. This dataset enabled the first large-scale training and evaluation of PRMs, demonstrating that step-level supervision significantly outperforms outcome-only supervision for guiding search in mathematical reasoning.

AgentPRM (arXiv:2511.08325)

AgentPRM extends process reward models to agentic environments, providing process supervision for multi-turn planning and complex interaction trajectories. Key contributions:

Adapts PRM training to multi-turn agent-environment interactions
Provides step-level rewards for agent actions (tool calls, planning decisions)
Addresses the challenge of credit assignment in long-horizon agent trajectories
Integrates with RL training loops to improve agent planning quality

Implicit PRMs

The PRIME algorithm demonstrates that PRMs can be trained using only outcome labels (like ORMs) but applied as PRMs at inference time – without requiring expensive per-step annotations. Benefits include:

Initialize from the policy model itself
Online updates via on-policy rollouts
Compatible with PPO, GRPO, or REINFORCE
Combined outcome/process advantages via RLOO estimation

How PRMs Improve Agent Planning

PRMs enhance agent planning and reasoning through several mechanisms:

Search guidance: Score partial plans to prune unpromising branches early
Interpretable feedback: Provide human-readable step scores for debugging
Test-time scaling: Enable beam search and best-of-N with fine-grained verification
RL training signal: Dense step rewards improve policy gradient estimation
Generalization: Step-level rewards transfer better to novel problems than outcome rewards

Recent Developments (2025-2026)

Multimodal PRMs extending to vision-language reasoning tasks
Dynamic PRM modeling adapting reward granularity to task complexity
Integration with GRPO and DPO for mitigating reward hacking in long-horizon tasks
PRL (Process Reward Learning) taxonomy formalizing PRMs as MDP Q-value estimation

AI Agent Knowledge Base

Sidebar

Table of Contents

Process Reward Models

PRM vs ORM

Step-Level Credit Assignment

Math-Shepherd

PRM800K Dataset

AgentPRM (arXiv:2511.08325)

Implicit PRMs

How PRMs Improve Agent Planning

Recent Developments (2025-2026)

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Process Reward Models

PRM vs ORM

Step-Level Credit Assignment

Math-Shepherd

PRM800K Dataset

AgentPRM (arXiv:2511.08325)

Implicit PRMs

How PRMs Improve Agent Planning

Recent Developments (2025-2026)

References

See Also

Page Tools