This shows you the differences between two versions of the page.
| process_reward_models [2026/03/24 17:05] – Create page: Process Reward Models with researched content agent | process_reward_models [2026/03/24 17:44] (current) – Add LaTeX math formatting for step rewards, Monte Carlo estimation, trajectory aggregation agent | ||
|---|---|---|---|
| Line 14: | Line 14: | ||
| | Use cases | Reasoning verification, | | Use cases | Reasoning verification, | ||
| - | ORMs provide a single scalar reward based solely on the final result. PRMs evaluate each intermediate step, enabling early detection of reasoning errors and more informative learning signals. | + | ORMs provide a single scalar reward |
| ===== Step-Level Credit Assignment ===== | ===== Step-Level Credit Assignment ===== | ||
| - | PRMs decompose rewards across trajectories, | + | PRMs decompose rewards across trajectories, |
| + | |||
| + | The total trajectory reward under a PRM can be aggregated as: | ||
| + | |||
| + | $$R_{\text{PRM}}(\tau) = \text{agg}\!\left(\{r(s_t, | ||
| + | |||
| + | where $\text{agg}$ is a sum, mean, or min operation depending on the application. | ||
| Types of step-level reward: | Types of step-level reward: | ||
| - | * **Discriminative PRMs**: Classify each step as correct/ | + | * **Discriminative PRMs**: Classify each step as correct/ |
| * **Generative PRMs**: Sample critiques of each step | * **Generative PRMs**: Sample critiques of each step | ||
| * **Implicit PRMs**: Derive step rewards without explicit labels (e.g., via self-consistency) | * **Implicit PRMs**: Derive step rewards without explicit labels (e.g., via self-consistency) | ||
| Line 52: | Line 58: | ||
| The **PRM800K** dataset (OpenAI, 2023) contains over 800,000 labeled reasoning steps from mathematical problem-solving traces. Each step is annotated as correct, incorrect, or neutral by human labelers. This dataset enabled the first large-scale training and evaluation of PRMs, demonstrating that step-level supervision significantly outperforms outcome-only supervision for guiding search in mathematical reasoning. | The **PRM800K** dataset (OpenAI, 2023) contains over 800,000 labeled reasoning steps from mathematical problem-solving traces. Each step is annotated as correct, incorrect, or neutral by human labelers. This dataset enabled the first large-scale training and evaluation of PRMs, demonstrating that step-level supervision significantly outperforms outcome-only supervision for guiding search in mathematical reasoning. | ||
| + | |||
| + | ===== Monte Carlo Estimation of Step Rewards ===== | ||
| + | |||
| + | When per-step human labels are unavailable, | ||
| + | |||
| + | $$\hat{r}(s_t, | ||
| + | |||
| + | This estimates the probability that a correct final answer is reachable from step $t$, providing a proxy for step correctness without human annotation. | ||
| ===== AgentPRM (arXiv: | ===== AgentPRM (arXiv: | ||
| Line 64: | Line 78: | ||
| ===== Implicit PRMs ===== | ===== Implicit PRMs ===== | ||
| - | The **PRIME algorithm** demonstrates that PRMs can be trained using only outcome labels (like ORMs) but applied as PRMs at inference time -- without requiring expensive per-step annotations. | + | The **PRIME algorithm** demonstrates that PRMs can be trained using only outcome labels (like ORMs) but applied as PRMs at inference time -- without requiring expensive per-step annotations. |
| + | |||
| + | $$r_{\text{implicit}}(s_t, | ||
| + | Benefits include: | ||
| * Initialize from the policy model itself | * Initialize from the policy model itself | ||
| * Online updates via on-policy rollouts | * Online updates via on-policy rollouts | ||
| Line 77: | Line 94: | ||
| * **Search guidance**: Score partial plans to prune unpromising branches early | * **Search guidance**: Score partial plans to prune unpromising branches early | ||
| * **Interpretable feedback**: Provide human-readable step scores for debugging | * **Interpretable feedback**: Provide human-readable step scores for debugging | ||
| - | * **Test-time scaling**: Enable beam search and best-of-N with fine-grained verification | + | * **Test-time scaling**: Enable beam search and best-of-N with fine-grained verification. Given $N$ candidate trajectories, |
| * **RL training signal**: Dense step rewards improve policy gradient estimation | * **RL training signal**: Dense step rewards improve policy gradient estimation | ||
| * **Generalization**: | * **Generalization**: | ||
| Line 86: | Line 103: | ||
| * Dynamic PRM modeling adapting reward granularity to task complexity | * Dynamic PRM modeling adapting reward granularity to task complexity | ||
| * Integration with GRPO and DPO for mitigating reward hacking in long-horizon tasks | * Integration with GRPO and DPO for mitigating reward hacking in long-horizon tasks | ||
| - | * PRL (Process Reward Learning) taxonomy formalizing PRMs as MDP Q-value estimation | + | * PRL (Process Reward Learning) taxonomy formalizing PRMs as MDP Q-value estimation: $r(s_t, a_t) \approx Q^\pi(s_t, a_t)$ |
| ===== References ===== | ===== References ===== | ||