Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Reinforcement Learning (RL) is a machine learning paradigm where autonomous agents learn to make sequential decisions through interaction with an environment. Unlike supervised learning, which relies on labeled training data, RL agents learn through trial-and-error by receiving numerical rewards or penalties that guide behavior toward desired outcomes 1).
The fundamental principle involves an agent observing the current state of an environment, selecting actions, and receiving feedback in the form of rewards. Over time, the agent learns to identify action sequences that maximize cumulative reward, effectively discovering optimal or near-optimal policies without explicit programming of desired behaviors 2).
A standard RL system comprises four primary elements: the agent (decision maker), the environment (system with which the agent interacts), states (representations of the current situation), and rewards (numerical feedback signals). The agent maintains or learns a policy—a mapping from states to actions—that determines behavior. The Markov Decision Process (MDP) provides the mathematical foundation for most RL algorithms, assuming that future states depend only on the current state and action, not on historical trajectories 3).
Value-based methods estimate the expected cumulative reward for states or state-action pairs. Q-learning, a foundational algorithm, iteratively updates estimates of action values based on observed rewards and bootstrap estimates of future values. Actor-critic methods combine value estimation with policy optimization, maintaining separate networks for value estimation (critic) and policy representation (actor) to improve learning stability and efficiency 4).
Agentic Reinforcement Learning (Agentic RL) trains LLM agents as autonomous decision-makers in dynamic environments, extending standard LLM RL from single-turn text generation to multi-turn, partially observable Markov decision processes (POMDPs) with long-horizon planning, tool use, and adaptive behavior.5)
| Aspect | Standard LLM RL | Agentic RL |
|---|---|---|
| Interaction | Single-turn (prompt → response) | Multi-turn (observe → act → observe → …) |
| Observation | Full prompt visible | Partial observability (POMDP) |
| Reward | Immediate (quality of one response) | Delayed, sparse (task completion after many steps) |
| Actions | Token generation | Semantic actions: tool calls, navigation, API requests |
| Planning horizon | Single response | Tens to hundreds of steps |
| State | Stateless per query | Stateful (memory, environment state) |
| Credit assignment | Per-response | Per-step across long trajectories |
The shift from single-turn to multi-turn interactions introduces unique challenges for training and interface design. Single-turn models produce a response to a single prompt with no follow-up, whereas multi-turn agents engage in continuous dialogue where each step builds on prior context and feedback.6) Multi-turn interactions enable agents to iteratively refine solutions, recover from errors, and solve complex tasks incrementally—but they also require post-training strategies that account for extended state management and trajectory-level credit assignment.
Agentic RL faces two fundamental challenges that standard LLM RL does not:
1. Sparse, non-instructive rewards: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure.
2. Credit assignment over long horizons: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. For a trajectory of $T$ steps with terminal reward $R$, the policy gradient has variance proportional to $T$:
$$\text{Var}\!\left[\nabla_\theta \mathcal{L}\right] \propto T \cdot \text{Var}[R]$$
Standard policy gradient estimators have high variance in this setting, motivating the use of dense intermediate rewards and value baselines.
Progressive reward shaping (arXiv:2512.07478) addresses sparse rewards by building agent capabilities incrementally.7) The reward function evolves across training stages:
$$R_{\text{prog}}(\tau, \alpha) = \alpha$$
Modern RL has enabled breakthrough achievements across diverse domains. Robotics represents a particularly compelling application area, where RL systems learn motor control and manipulation strategies. Toyota's CUE7 robot demonstrates this capability, employing reinforcement learning combined with hybrid control architectures to achieve near-perfect consistency in basketball shooting. This system learns optimal trajectories and compensates for physical variations through accumulated experience, illustrating how RL addresses real-world control challenges with high precision.
Beyond robotics, RL powers game-playing systems that exceed human performance.