====== Process Reward Models ======

**Process Reward Models (PRMs)** are reward functions that assign dense, step-level scores to intermediate reasoning steps in a multi-step trajectory, enabling fine-grained supervision for tasks like mathematical reasoning, planning, and agentic decision-making.

===== PRM vs ORM =====

The fundamental distinction in reward modeling for reasoning:

^ Aspect ^ Process Reward Model (PRM) ^ Outcome Reward Model (ORM) ^
| Reward granularity | Step-wise (dense) | Terminal only (sparse) |
| Credit assignment | Fine-grained, chain-sensitive | Outcome-only, no step attribution |
| Training data | Per-step correctness labels | Final answer correctness only |
| Reward hacking | More robust (detects bad steps) | Vulnerable (right answer, wrong reasoning) |
| Use cases | Reasoning verification, search guidance | Simple pass/fail evaluation |

ORMs provide a single scalar reward $R(\tau)$ based solely on the final result. PRMs evaluate each intermediate step, enabling early detection of reasoning errors and more informative learning signals.

===== Step-Level Credit Assignment =====

PRMs decompose rewards across trajectories, enabling precise attribution of which steps contributed to success or failure. For a trajectory $\tau = (s_0, a_0, s_1, \ldots, s_T, a_T)$, a PRM outputs rewards $r(s_t, a_t)$ for each step $t$. This addresses the fundamental **credit assignment problem** in multi-step reasoning.

The total trajectory reward under a PRM can be aggregated as:

$$R_{\text{PRM}}(\tau) = \text{agg}\!\left(\{r(s_t, a_t)\}_{t=0}^{T}\right)$$

where $\text{agg}$ is a sum, mean, or min operation depending on the application.

Types of step-level reward:
  * **Discriminative PRMs**: Classify each step as correct/incorrect, outputting $r(s_t, a_t) \in \{0, 1\}$
  * **Generative PRMs**: Sample critiques of each step
  * **Implicit PRMs**: Derive step rewards without explicit labels (e.g., via self-consistency)
  * **Trajectory-level aggregation**: Combine step scores via sum, mean, or min operations

===== Math-Shepherd =====

Math-Shepherd trains PRMs for mathematical reasoning using binary step correctness labels ("+"/"-" tokens). The approach:

  - Annotate intermediate reasoning steps with correctness labels
  - Train the PRM with masked prediction forcing binary classification at each step
  - Use special token handling for reward computation
  - Deploy as a deterministic verifier (distinct from value models that estimate future success)

<code python>
# Simplified PRM scoring for math reasoning
def score_reasoning_steps(prm, question, steps):
    """Score each reasoning step with a Process Reward Model."""
    scores = []
    context = question
    for step in steps:
        context += "\n" + step
        score = prm.predict_correctness(context)
        scores.append(score)
    # Aggregate: minimum score identifies weakest step
    return scores, min(scores)
</code>

===== PRM800K Dataset =====

The **PRM800K** dataset (OpenAI, 2023) contains over 800,000 labeled reasoning steps from mathematical problem-solving traces. Each step is annotated as correct, incorrect, or neutral by human labelers. This dataset enabled the first large-scale training and evaluation of PRMs, demonstrating that step-level supervision significantly outperforms outcome-only supervision for guiding search in mathematical reasoning.

===== Monte Carlo Estimation of Step Rewards =====

When per-step human labels are unavailable, step-level rewards can be estimated via Monte Carlo rollouts. For step $t$ in a trajectory, sample $K$ completions from step $t$ onward and estimate:

$$\hat{r}(s_t, a_t) = \frac{1}{K}\sum_{k=1}^{K} \mathbf{1}\!\left[\text{completion}_k \text{ reaches correct answer}\right]$$

This estimates the probability that a correct final answer is reachable from step $t$, providing a proxy for step correctness without human annotation.

===== AgentPRM (arXiv:2511.08325) =====

**AgentPRM** extends process reward models to agentic environments, providing process supervision for multi-turn planning and complex interaction trajectories. Key contributions:

  * Adapts PRM training to multi-turn agent-environment interactions
  * Provides step-level rewards for agent actions (tool calls, planning decisions)
  * Addresses the challenge of credit assignment in long-horizon agent trajectories
  * Integrates with RL training loops to improve agent planning quality

===== Implicit PRMs =====

The **PRIME algorithm** demonstrates that PRMs can be trained using only outcome labels (like ORMs) but applied as PRMs at inference time -- without requiring expensive per-step annotations. The implicit step reward is derived from the policy model itself:

$$r_{\text{implicit}}(s_t, a_t) = \log \pi_\theta(a_t | s_t) - \log \pi_{\text{ref}}(a_t | s_t)$$

Benefits include:
  * Initialize from the policy model itself
  * Online updates via on-policy rollouts
  * Compatible with PPO, GRPO, or REINFORCE
  * Combined outcome/process advantages via RLOO estimation

===== How PRMs Improve Agent Planning =====

PRMs enhance agent planning and reasoning through several mechanisms:

  * **Search guidance**: Score partial plans to prune unpromising branches early
  * **Interpretable feedback**: Provide human-readable step scores for debugging
  * **Test-time scaling**: Enable beam search and best-of-N with fine-grained verification. Given $N$ candidate trajectories, the PRM selects: $\tau^* = \arg\max_{\tau \in \{\tau_1,\ldots,\tau_N\}} R_{\text{PRM}}(\tau)$
  * **RL training signal**: Dense step rewards improve policy gradient estimation
  * **Generalization**: Step-level rewards transfer better to novel problems than outcome rewards

===== Recent Developments (2025-2026) =====

  * Multimodal PRMs extending to vision-language reasoning tasks
  * Dynamic PRM modeling adapting reward granularity to task complexity
  * Integration with GRPO and DPO for mitigating reward hacking in long-horizon tasks
  * PRL (Process Reward Learning) taxonomy formalizing PRMs as MDP Q-value estimation: $r(s_t, a_t) \approx Q^\pi(s_t, a_t)$

===== References =====

  * [[https://arxiv.org/abs/2511.08325|arXiv:2511.08325 - AgentPRM: Process Reward Models for Agent Planning]]
  * [[https://arxiv.org/abs/2305.20050|arXiv:2305.20050 - Let's Verify Step by Step (PRM800K)]]
  * [[https://arxiv.org/abs/2501.07301|arXiv:2501.07301 - Lessons from PRM Development]]

===== See Also =====

  * [[test_time_compute_scaling|Test-Time Compute Scaling]] - PRMs as verifiers for inference-time search
  * [[agent_rlvr|Agent RLVR]] - RL training using verifiable rewards
  * [[agentic_reinforcement_learning|Agentic Reinforcement Learning]] - RL for training LLM agents