Reinforcement Learning

Reinforcement Learning (RL) is a machine learning paradigm where autonomous agents learn to make sequential decisions through interaction with an environment. Unlike supervised learning, which relies on labeled training data, RL agents learn through trial-and-error by receiving numerical rewards or penalties that guide behavior toward desired outcomes ¹⁾.

The fundamental principle involves an agent observing the current state of an environment, selecting actions, and receiving feedback in the form of rewards. Over time, the agent learns to identify action sequences that maximize cumulative reward, effectively discovering optimal or near-optimal policies without explicit programming of desired behaviors ²⁾.

Core Components and Framework

A standard RL system comprises four primary elements: the agent (decision maker), the environment (system with which the agent interacts), states (representations of the current situation), and rewards (numerical feedback signals). The agent maintains or learns a policy—a mapping from states to actions—that determines behavior. The Markov Decision Process (MDP) provides the mathematical foundation for most RL algorithms, assuming that future states depend only on the current state and action, not on historical trajectories ³⁾.

Value-based methods estimate the expected cumulative reward for states or state-action pairs. Q-learning, a foundational algorithm, iteratively updates estimates of action values based on observed rewards and bootstrap estimates of future values. Actor-critic methods combine value estimation with policy optimization, maintaining separate networks for value estimation (critic) and policy representation (actor) to improve learning stability and efficiency ⁴⁾.

Agentic Reinforcement Learning

Agentic Reinforcement Learning (Agentic RL) trains LLM agents as autonomous decision-makers in dynamic environments, extending standard LLM RL from single-turn text generation to multi-turn, partially observable Markov decision processes (POMDPs) with long-horizon planning, tool use, and adaptive behavior.⁵⁾

Agentic RL vs Standard LLM RL

Aspect	Standard LLM RL	Agentic RL
Interaction	Single-turn (prompt → response)	Multi-turn (observe → act → observe → …)
Observation	Full prompt visible	Partial observability (POMDP)
Reward	Immediate (quality of one response)	Delayed, sparse (task completion after many steps)
Actions	Token generation	Semantic actions: tool calls, navigation, API requests
Planning horizon	Single response	Tens to hundreds of steps
State	Stateless per query	Stateful (memory, environment state)
Credit assignment	Per-response	Per-step across long trajectories

The shift from single-turn to multi-turn interactions introduces unique challenges for training and interface design. Single-turn models produce a response to a single prompt with no follow-up, whereas multi-turn agents engage in continuous dialogue where each step builds on prior context and feedback.⁶⁾ Multi-turn interactions enable agents to iteratively refine solutions, recover from errors, and solve complex tasks incrementally—but they also require post-training strategies that account for extended state management and trajectory-level credit assignment.

Core Challenges in Agentic RL

Agentic RL faces two fundamental challenges that standard LLM RL does not:

1. Sparse, non-instructive rewards: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure.

2. Credit assignment over long horizons: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. For a trajectory of $T$ steps with terminal reward $R$, the policy gradient has variance proportional to $T$:

$$\text{Var}\!\left[\nabla_\theta \mathcal{L}\right] \propto T \cdot \text{Var}[R]$$

Standard policy gradient estimators have high variance in this setting, motivating the use of dense intermediate rewards and value baselines.

Progressive Reward Shaping

Progressive reward shaping (arXiv:2512.07478) addresses sparse rewards by building agent capabilities incrementally.⁷⁾ The reward function evolves across training stages:

$$R_{\text{prog}}(\tau, \alpha) = \alpha$$

Applications and Real-World Implementations

Modern RL has enabled breakthrough achievements across diverse domains. Robotics represents a particularly compelling application area, where RL systems learn motor control and manipulation strategies. Toyota's CUE7 robot demonstrates this capability, employing reinforcement learning combined with hybrid control architectures to achieve near-perfect consistency in basketball shooting. This system learns optimal trajectories and compensates for physical variations through accumulated experience, illustrating how RL addresses real-world control challenges with high precision.

Beyond robotics, RL powers game-playing systems that exceed human performance.

References

¹⁾

Mnih et al. - Playing Atari with Deep Reinforcement Learning (2013

²⁾

Lillicrap et al. - Continuous Control with Deep Reinforcement Learning (2015

³⁾

Kaelbling, Littman, and Moore - Reinforcement Learning: A Survey (1996

⁴⁾

Mnih et al. - Asynchronous Methods for Deep Reinforcement Learning (2016

⁵⁾

“Agentic RL Taxonomy and Survey.” arXiv:2509.02547

⁶⁾

Interconnects - What I've Been Building: ATOM Report

⁷⁾

“Enhancing Agentic RL with Progressive Reward Shaping.” arXiv:2512.07478

AI Agent Knowledge Base

Sidebar

Table of Contents

Reinforcement Learning

Core Components and Framework

Agentic Reinforcement Learning

Agentic RL vs Standard LLM RL

Core Challenges in Agentic RL

Progressive Reward Shaping

Applications and Real-World Implementations

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Reinforcement Learning

Core Components and Framework

Agentic Reinforcement Learning

Agentic RL vs Standard LLM RL

Core Challenges in Agentic RL

Progressive Reward Shaping

Applications and Real-World Implementations

See Also

References

Page Tools