====== Process Reward Models ====== **Process Reward Models (PRMs)** are reward functions that assign dense, step-level scores to intermediate reasoning steps in a multi-step trajectory, enabling fine-grained supervision for tasks like mathematical reasoning, planning, and agentic decision-making. ===== PRM vs ORM ===== The fundamental distinction in reward modeling for reasoning: ^ Aspect ^ Process Reward Model (PRM) ^ Outcome Reward Model (ORM) ^ | Reward granularity | Step-wise (dense) | Terminal only (sparse) | | Credit assignment | Fine-grained, chain-sensitive | Outcome-only, no step attribution | | Training data | Per-step correctness labels | Final answer correctness only | | Reward hacking | More robust (detects bad steps) | Vulnerable (right answer, wrong reasoning) | | Use cases | Reasoning verification, search guidance | Simple pass/fail evaluation | ORMs provide a single scalar reward $R(\tau)$ based solely on the final result. PRMs evaluate each intermediate step, enabling early detection of reasoning errors and more informative learning signals. ===== Step-Level Credit Assignment ===== PRMs decompose rewards across trajectories, enabling precise attribution of which steps contributed to success or failure. For a trajectory $\tau = (s_0, a_0, s_1, \ldots, s_T, a_T)$, a PRM outputs rewards $r(s_t, a_t)$ for each step $t$. This addresses the fundamental **credit assignment problem** in multi-step reasoning. The total trajectory reward under a PRM can be aggregated as: $$R_{\text{PRM}}(\tau) = \text{agg}\!\left(\{r(s_t, a_t)\}_{t=0}^{T}\right)$$ where $\text{agg}$ is a sum, mean, or min operation depending on the application. Types of step-level reward: * **Discriminative PRMs**: Classify each step as correct/incorrect, outputting $r(s_t, a_t) \in \{0, 1\}$ * **Generative PRMs**: Sample critiques of each step * **Implicit PRMs**: Derive step rewards without explicit labels (e.g., via self-consistency) * **Trajectory-level aggregation**: Combine step scores via sum, mean, or min operations ===== Math-Shepherd ===== Math-Shepherd trains PRMs for mathematical reasoning using binary step correctness labels ("+"/"-" tokens). The approach: - Annotate intermediate reasoning steps with correctness labels - Train the PRM with masked prediction forcing binary classification at each step - Use special token handling for reward computation - Deploy as a deterministic verifier (distinct from value models that estimate future success) # Simplified PRM scoring for math reasoning def score_reasoning_steps(prm, question, steps): """Score each reasoning step with a Process Reward Model.""" scores = [] context = question for step in steps: context += "\n" + step score = prm.predict_correctness(context) scores.append(score) # Aggregate: minimum score identifies weakest step return scores, min(scores) ===== PRM800K Dataset ===== The **PRM800K** dataset (OpenAI, 2023) contains over 800,000 labeled reasoning steps from mathematical problem-solving traces. Each step is annotated as correct, incorrect, or neutral by human labelers. This dataset enabled the first large-scale training and evaluation of PRMs, demonstrating that step-level supervision significantly outperforms outcome-only supervision for guiding search in mathematical reasoning. ===== Monte Carlo Estimation of Step Rewards ===== When per-step human labels are unavailable, step-level rewards can be estimated via Monte Carlo rollouts. For step $t$ in a trajectory, sample $K$ completions from step $t$ onward and estimate: $$\hat{r}(s_t, a_t) = \frac{1}{K}\sum_{k=1}^{K} \mathbf{1}\!\left[\text{completion}_k \text{ reaches correct answer}\right]$$ This estimates the probability that a correct final answer is reachable from step $t$, providing a proxy for step correctness without human annotation. ===== AgentPRM (arXiv:2511.08325) ===== **AgentPRM** extends process reward models to agentic environments, providing process supervision for multi-turn planning and complex interaction trajectories. Key contributions: * Adapts PRM training to multi-turn agent-environment interactions * Provides step-level rewards for agent actions (tool calls, planning decisions) * Addresses the challenge of credit assignment in long-horizon agent trajectories * Integrates with RL training loops to improve agent planning quality ===== Implicit PRMs ===== The **PRIME algorithm** demonstrates that PRMs can be trained using only outcome labels (like ORMs) but applied as PRMs at inference time -- without requiring expensive per-step annotations. The implicit step reward is derived from the policy model itself: $$r_{\text{implicit}}(s_t, a_t) = \log \pi_\theta(a_t | s_t) - \log \pi_{\text{ref}}(a_t | s_t)$$ Benefits include: * Initialize from the policy model itself * Online updates via on-policy rollouts * Compatible with PPO, GRPO, or REINFORCE * Combined outcome/process advantages via RLOO estimation ===== How PRMs Improve Agent Planning ===== PRMs enhance agent planning and reasoning through several mechanisms: * **Search guidance**: Score partial plans to prune unpromising branches early * **Interpretable feedback**: Provide human-readable step scores for debugging * **Test-time scaling**: Enable beam search and best-of-N with fine-grained verification. Given $N$ candidate trajectories, the PRM selects: $\tau^* = \arg\max_{\tau \in \{\tau_1,\ldots,\tau_N\}} R_{\text{PRM}}(\tau)$ * **RL training signal**: Dense step rewards improve policy gradient estimation * **Generalization**: Step-level rewards transfer better to novel problems than outcome rewards ===== Recent Developments (2025-2026) ===== * Multimodal PRMs extending to vision-language reasoning tasks * Dynamic PRM modeling adapting reward granularity to task complexity * Integration with GRPO and DPO for mitigating reward hacking in long-horizon tasks * PRL (Process Reward Learning) taxonomy formalizing PRMs as MDP Q-value estimation: $r(s_t, a_t) \approx Q^\pi(s_t, a_t)$ ===== References ===== * [[https://arxiv.org/abs/2511.08325|arXiv:2511.08325 - AgentPRM: Process Reward Models for Agent Planning]] * [[https://arxiv.org/abs/2305.20050|arXiv:2305.20050 - Let's Verify Step by Step (PRM800K)]] * [[https://arxiv.org/abs/2501.07301|arXiv:2501.07301 - Lessons from PRM Development]] ===== See Also ===== * [[test_time_compute_scaling|Test-Time Compute Scaling]] - PRMs as verifiers for inference-time search * [[agent_rlvr|Agent RLVR]] - RL training using verifiable rewards * [[agentic_reinforcement_learning|Agentic Reinforcement Learning]] - RL for training LLM agents