Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Self-play training enables LLM agents to improve through interaction with copies of themselves, eliminating the dependency on human-curated training data. By having agents act as both task proposer and solver (or as competing players in zero-sum games), self-play generates an autocurriculum of progressively harder challenges that drives capability beyond human demonstration limits. This page covers competitive approaches like SPIRAL, cooperative self-play, and the broader paradigm of self-improvement without human supervision.
SPIRAL (Self-Play on zero-sum games Incentivizes Reasoning via multi-Agent multi-turn reinforcement Learning) by Liu et al. (2025) is a framework where LLMs learn reasoning by playing multi-turn zero-sum games against continuously improving versions of themselves.
Key innovations:
Results:
In competitive (adversarial) self-play, agents play opposing roles in zero-sum settings:
In cooperative self-play, agents collaborate to generate diverse training scenarios:
# Simplified SPIRAL-style self-play training loop import torch from typing import Tuple class SelfPlayTrainer: def __init__(self, model, games, learning_rate=1e-5): self.model = model self.opponent = model.copy() # Self-play partner self.games = games # ["tictactoe", "kuhn_poker", "negotiation"] self.optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) def play_episode(self, game) -> Tuple[list, float]: """Play a multi-turn game, return trajectory and reward.""" state = game.reset() trajectory = [] for turn in range(game.max_turns): if turn % 2 == 0: action = self.model.generate_action(state, role="player1") else: action = self.opponent.generate_action(state, role="player2") state, done = game.step(action) trajectory.append((state, action)) if done: break reward = game.get_reward() # Zero-sum: +1/-1 return trajectory, reward def role_conditioned_advantage(self, trajectory, reward, role): """RAE: compute advantages conditioned on agent role.""" advantages = [] for t, (state, action) in enumerate(trajectory): if self.get_role(t) == role: baseline = self.model.value_estimate(state, role=role) advantages.append(reward - baseline) return advantages def train_step(self): for game in self.games: trajectory, reward = self.play_episode(game) advantages = self.role_conditioned_advantage( trajectory, reward, role="player1" ) loss = self.compute_policy_gradient_loss(trajectory, advantages) loss.backward() self.optimizer.step() # Periodically update opponent to current model self.opponent.load_state_dict(self.model.state_dict())
| Aspect | Self-Play | Supervised Fine-Tuning |
|---|---|---|
| Data source | Self-generated | Human-curated |
| Difficulty scaling | Adaptive (autocurriculum) | Fixed |
| Ceiling | Unbounded (superhuman possible) | Bounded by human demonstrations |
| Reward signal | Verifiable game outcomes | Human labels |
| Cost | Compute-intensive | Annotation-intensive |
Self-play optimizes a minimax objective in the zero-sum setting:
<latex>\max_{\theta} \min_{\phi} \mathbb{E}_{\tau \sim \pi_\theta, \pi_\phi} [R(\tau)]</latex>
where <latex>\pi_\theta</latex> is the learning agent, <latex>\pi_\phi</latex> is the self-play opponent, and <latex>R(\tau)</latex> is the trajectory reward. The autocurriculum emerges because as <latex>\pi_\theta</latex> improves, <latex>\pi_\phi</latex> (updated periodically) presents harder challenges.
SPIRAL's Role-Conditioned Advantage Estimation:
<latex>A^{RAE}(s_t, a_t, r) = Q(s_t, a_t | r) - V(s_t | r)</latex>
where <latex>r</latex> is the role conditioning, ensuring advantages are computed from each player's perspective in the multi-agent game.