Table of Contents

Process Reward Models

Process Reward Models (PRMs) are reward functions that assign dense, step-level scores to intermediate reasoning steps in a multi-step trajectory, enabling fine-grained supervision for tasks like mathematical reasoning, planning, and agentic decision-making.

PRM vs ORM

The fundamental distinction in reward modeling for reasoning:

Aspect Process Reward Model (PRM) Outcome Reward Model (ORM)
Reward granularity Step-wise (dense) Terminal only (sparse)
Credit assignment Fine-grained, chain-sensitive Outcome-only, no step attribution
Training data Per-step correctness labels Final answer correctness only
Reward hacking More robust (detects bad steps) Vulnerable (right answer, wrong reasoning)
Use cases Reasoning verification, search guidance Simple pass/fail evaluation

ORMs provide a single scalar reward $R(\tau)$ based solely on the final result. PRMs evaluate each intermediate step, enabling early detection of reasoning errors and more informative learning signals.

Step-Level Credit Assignment

PRMs decompose rewards across trajectories, enabling precise attribution of which steps contributed to success or failure. For a trajectory $\tau = (s_0, a_0, s_1, \ldots, s_T, a_T)$, a PRM outputs rewards $r(s_t, a_t)$ for each step $t$. This addresses the fundamental credit assignment problem in multi-step reasoning.

The total trajectory reward under a PRM can be aggregated as:

$$R_{\text{PRM}}(\tau) = \text{agg}\!\left(\{r(s_t, a_t)\}_{t=0}^{T}\right)$$

where $\text{agg}$ is a sum, mean, or min operation depending on the application.

Types of step-level reward:

Math-Shepherd

Math-Shepherd trains PRMs for mathematical reasoning using binary step correctness labels (“+”/“-” tokens). The approach:

  1. Annotate intermediate reasoning steps with correctness labels
  2. Train the PRM with masked prediction forcing binary classification at each step
  3. Use special token handling for reward computation
  4. Deploy as a deterministic verifier (distinct from value models that estimate future success)
# Simplified PRM scoring for math reasoning
def score_reasoning_steps(prm, question, steps):
    """Score each reasoning step with a Process Reward Model."""
    scores = []
    context = question
    for step in steps:
        context += "\n" + step
        score = prm.predict_correctness(context)
        scores.append(score)
    # Aggregate: minimum score identifies weakest step
    return scores, min(scores)

PRM800K Dataset

The PRM800K dataset (OpenAI, 2023) contains over 800,000 labeled reasoning steps from mathematical problem-solving traces. Each step is annotated as correct, incorrect, or neutral by human labelers. This dataset enabled the first large-scale training and evaluation of PRMs, demonstrating that step-level supervision significantly outperforms outcome-only supervision for guiding search in mathematical reasoning.

Monte Carlo Estimation of Step Rewards

When per-step human labels are unavailable, step-level rewards can be estimated via Monte Carlo rollouts. For step $t$ in a trajectory, sample $K$ completions from step $t$ onward and estimate:

$$\hat{r}(s_t, a_t) = \frac{1}{K}\sum_{k=1}^{K} \mathbf{1}\!\left[\text{completion}_k \text{ reaches correct answer}\right]$$

This estimates the probability that a correct final answer is reachable from step $t$, providing a proxy for step correctness without human annotation.

AgentPRM (arXiv:2511.08325)

AgentPRM extends process reward models to agentic environments, providing process supervision for multi-turn planning and complex interaction trajectories. Key contributions:

Implicit PRMs

The PRIME algorithm demonstrates that PRMs can be trained using only outcome labels (like ORMs) but applied as PRMs at inference time – without requiring expensive per-step annotations. The implicit step reward is derived from the policy model itself:

$$r_{\text{implicit}}(s_t, a_t) = \log \pi_\theta(a_t | s_t) - \log \pi_{\text{ref}}(a_t | s_t)$$

Benefits include:

How PRMs Improve Agent Planning

PRMs enhance agent planning and reasoning through several mechanisms:

Recent Developments (2025-2026)

References

See Also