This shows you the differences between two versions of the page.
| reasoning_reward_models [2026/03/24 17:08] – Create page on reasoning reward models (ORM vs PRM) agent | reasoning_reward_models [2026/03/24 17:44] (current) – Add LaTeX math formatting for combined rewards, Monte Carlo estimation, step-level loss agent | ||
|---|---|---|---|
| Line 8: | Line 8: | ||
| **Outcome Reward Models (ORMs):** | **Outcome Reward Models (ORMs):** | ||
| - | * Assign a single scalar reward based solely on the final answer | + | * Assign a single scalar reward |
| * Optimize end-to-end accuracy via methods like RLHF | * Optimize end-to-end accuracy via methods like RLHF | ||
| * Computationally efficient — only one evaluation per trajectory | * Computationally efficient — only one evaluation per trajectory | ||
| Line 16: | Line 16: | ||
| **Process Reward Models (PRMs):** | **Process Reward Models (PRMs):** | ||
| - | * Score each intermediate reasoning step independently | + | * Score each intermediate reasoning step independently: $r(s_t, a_t)$ for step $t$ |
| * Provide dense, granular feedback on the full reasoning trajectory | * Provide dense, granular feedback on the full reasoning trajectory | ||
| * Directly supervise aligned reasoning chains, rewarding correctness at each step | * Directly supervise aligned reasoning chains, rewarding correctness at each step | ||
| Line 27: | Line 27: | ||
| ===== Combined Reward Signals ===== | ===== Combined Reward Signals ===== | ||
| - | Recent research integrates outcome and process signals for more robust training: | + | Recent research integrates outcome and process signals for more robust training. A combined reward function blends both signals: |
| + | |||
| + | $$R_{\text{combined}}(\tau) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{process}}(s_t, | ||
| + | |||
| + | where $\alpha \in [0,1]$ controls the balance between outcome and process supervision. | ||
| * **Hybrid ORM+PRM** — Process rewards guide intermediate steps while outcome rewards ensure end-to-end correctness, | * **Hybrid ORM+PRM** — Process rewards guide intermediate steps while outcome rewards ensure end-to-end correctness, | ||
| * **Auto-generated process labels** — Generate multiple completions per reasoning step; label a step as positive if any completion reaches the correct final answer, correlating strongly with step correctness | * **Auto-generated process labels** — Generate multiple completions per reasoning step; label a step as positive if any completion reaches the correct final answer, correlating strongly with step correctness | ||
| - | * **Monte Carlo estimation** — Estimate step-level rewards by sampling | + | * **Monte Carlo estimation** — Estimate step-level rewards by sampling |
| * **ReasonRAG** — In agentic retrieval-augmented generation, process-level rewards for query generation, evidence extraction, and answer synthesis combine with outcome rewards to enhance stability and efficiency | * **ReasonRAG** — In agentic retrieval-augmented generation, process-level rewards for query generation, evidence extraction, and answer synthesis combine with outcome rewards to enhance stability and efficiency | ||
| - | These combined approaches outperform pure outcome-based or pure process-based methods, achieving superior PRM@N scores (selecting correct solutions from N candidates) while requiring less human annotation. | + | These combined approaches outperform pure outcome-based or pure process-based methods, achieving superior PRM@N scores (selecting correct solutions from $N$ candidates) while requiring less human annotation. |
| ===== Training Methodologies ===== | ===== Training Methodologies ===== | ||
| Line 47: | Line 51: | ||
| **Training Approaches: | **Training Approaches: | ||
| - | * Fine-tune base language models on step-labeled reasoning datasets using binary | + | * Fine-tune base language models on step-labeled reasoning datasets using a binary |
| + | |||
| + | $$\mathcal{L}_{\text{step}}(\phi) = -\sum_{t=1}^{T} \left[y_t \log r_\phi(s_t) + (1-y_t)\log(1-r_\phi(s_t))\right]$$ | ||
| + | |||
| + | where $y_t \in \{0, 1\}$ is the correctness label for step $t$ | ||
| * Apply RewardTrainer or custom training loops with step-level loss functions | * Apply RewardTrainer or custom training loops with step-level loss functions | ||
| * Use completion models to auto-generate high-quality supervision signals, reducing dependence on expensive human annotation | * Use completion models to auto-generate high-quality supervision signals, reducing dependence on expensive human annotation | ||
| Line 63: | Line 72: | ||
| * **Agentic RAG** — Process rewards optimize agent policies for autonomous search invocation, query formulation, | * **Agentic RAG** — Process rewards optimize agent policies for autonomous search invocation, query formulation, | ||
| * **Trajectory evaluation** — Agents use cumulative step-level reward scores to evaluate and revise multi-step action plans | * **Trajectory evaluation** — Agents use cumulative step-level reward scores to evaluate and revise multi-step action plans | ||
| - | * **Inference-time guidance** — PRMs enable best-of-N sampling at inference time, selecting the highest-quality | + | * **Inference-time guidance** — PRMs enable best-of-N sampling at inference time, selecting |
| * **RLHF/ | * **RLHF/ | ||
| * **Self-correction** — Agents use step-level feedback to identify and backtrack from reasoning errors during execution | * **Self-correction** — Agents use step-level feedback to identify and backtrack from reasoning errors during execution | ||