Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Process Reward Models (PRMs) are reward functions that assign dense, step-level scores to intermediate reasoning steps in a multi-step trajectory, enabling fine-grained supervision for tasks like mathematical reasoning, planning, and agentic decision-making.
The fundamental distinction in reward modeling for reasoning:
| Aspect | Process Reward Model (PRM) | Outcome Reward Model (ORM) |
|---|---|---|
| Reward granularity | Step-wise (dense) | Terminal only (sparse) |
| Credit assignment | Fine-grained, chain-sensitive | Outcome-only, no step attribution |
| Training data | Per-step correctness labels | Final answer correctness only |
| Reward hacking | More robust (detects bad steps) | Vulnerable (right answer, wrong reasoning) |
| Use cases | Reasoning verification, search guidance | Simple pass/fail evaluation |
ORMs provide a single scalar reward based solely on the final result. PRMs evaluate each intermediate step, enabling early detection of reasoning errors and more informative learning signals.
PRMs decompose rewards across trajectories, enabling precise attribution of which steps contributed to success or failure. For a trajectory tau = (s_0, a_0, s_1, …, s_T, a_T), a PRM outputs rewards r(s_t, a_t) for each step t. This addresses the fundamental credit assignment problem in multi-step reasoning.
Types of step-level reward:
Math-Shepherd trains PRMs for mathematical reasoning using binary step correctness labels (“+”/“-” tokens). The approach:
# Simplified PRM scoring for math reasoning def score_reasoning_steps(prm, question, steps): """Score each reasoning step with a Process Reward Model.""" scores = [] context = question for step in steps: context += "\n" + step score = prm.predict_correctness(context) scores.append(score) # Aggregate: minimum score identifies weakest step return scores, min(scores)
The PRM800K dataset (OpenAI, 2023) contains over 800,000 labeled reasoning steps from mathematical problem-solving traces. Each step is annotated as correct, incorrect, or neutral by human labelers. This dataset enabled the first large-scale training and evaluation of PRMs, demonstrating that step-level supervision significantly outperforms outcome-only supervision for guiding search in mathematical reasoning.
AgentPRM extends process reward models to agentic environments, providing process supervision for multi-turn planning and complex interaction trajectories. Key contributions:
The PRIME algorithm demonstrates that PRMs can be trained using only outcome labels (like ORMs) but applied as PRMs at inference time – without requiring expensive per-step annotations. Benefits include:
PRMs enhance agent planning and reasoning through several mechanisms: