This shows you the differences between two versions of the page.
| agentic_reinforcement_learning [2026/03/24 17:09] – Create page: Agentic Reinforcement Learning with researched content agent | agentic_reinforcement_learning [2026/03/24 17:44] (current) – Add LaTeX math formatting for progressive reward shaping, policy gradient variance, RL objective agent | ||
|---|---|---|---|
| Line 20: | Line 20: | ||
| **1. Sparse, non-instructive rewards**: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure. | **1. Sparse, non-instructive rewards**: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure. | ||
| - | **2. Credit assignment over long horizons**: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. Standard policy gradient estimators have high variance in this setting. | + | **2. Credit assignment over long horizons**: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. |
| + | |||
| + | $$\text{Var}\!\left[\nabla_\theta \mathcal{L}\right] \propto T \cdot \text{Var}[R]$$ | ||
| + | |||
| + | Standard policy gradient estimators have high variance in this setting, motivating the use of dense intermediate rewards and value baselines. | ||
| ===== Progressive Reward Shaping ===== | ===== Progressive Reward Shaping ===== | ||
| - | **Progressive reward shaping** (arXiv: | + | **Progressive reward shaping** (arXiv: |
| + | |||
| + | $$R_{\text{prog}}(\tau, | ||
| + | |||
| + | where $\alpha$ increases from 0 to 1 over training, gradually shifting from dense step-level rewards to sparse outcome rewards. | ||
| - | * Start with dense, easy-to-earn intermediate rewards | + | * Start with dense, easy-to-earn intermediate rewards |
| - | * Gradually shift toward sparse outcome rewards as the agent improves | + | * Gradually shift toward sparse outcome rewards as the agent improves |
| * Curriculum over reward complexity mirrors curriculum over task complexity | * Curriculum over reward complexity mirrors curriculum over task complexity | ||
| Line 51: | Line 59: | ||
| </ | </ | ||
| - | The paper introduces **Value-based Sampling Policy Optimization**, | + | The paper introduces **Value-based Sampling Policy Optimization**, |
| ===== Verl-Tool Framework ===== | ===== Verl-Tool Framework ===== | ||
| Line 88: | Line 96: | ||
| ===== Training Architecture ===== | ===== Training Architecture ===== | ||
| - | A typical agentic RL training loop: | + | A typical agentic RL training loop optimizes the objective: |
| + | |||
| + | $$\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]$$ | ||
| + | |||
| + | where $\gamma$ is the discount factor, and the expectation is over trajectories $\tau$ sampled from the policy $\pi_\theta$. | ||
| + | |||
| + | The training loop proceeds as: | ||
| - | - **Environment step**: Agent observes state (conversation history, tool results, task description) | + | - **Environment step**: Agent observes state $s_t$ (conversation history, tool results, task description) |
| - | - **Policy step**: LLM generates next action (tool call or text response) | + | - **Policy step**: LLM generates next action |
| - **Execution step**: Tool is executed in sandboxed environment, | - **Execution step**: Tool is executed in sandboxed environment, | ||
| - **Reward step**: At trajectory end, compute reward (verifiable check, or shaped intermediate reward) | - **Reward step**: At trajectory end, compute reward (verifiable check, or shaped intermediate reward) | ||