====== Self-Play Training for LLM Agents ======
Self-play training enables LLM agents to improve through interaction with copies of themselves, eliminating the dependency on human-curated training data. By having agents act as both task proposer and solver (or as competing players in zero-sum games), self-play generates an **autocurriculum** of progressively harder challenges that drives capability beyond human demonstration limits. This page covers competitive approaches like SPIRAL, cooperative self-play, and the broader paradigm of self-improvement without human supervision.
===== SPIRAL: Multi-Agent Multi-Turn RL =====
SPIRAL (Self-Play on zero-sum games Incentivizes Reasoning via multi-Agent multi-turn reinforcement Learning) by Liu et al. (2025) is a framework where LLMs learn reasoning by playing multi-turn zero-sum games against continuously improving versions of themselves.
**Key innovations:**
* **Game-based training**: Models play TicTacToe, Kuhn Poker, and Simple Negotiation — games requiring strategic reasoning, bluffing, and planning
* **Role-Conditioned Advantage Estimation (RAE)**: A technique to stabilize multi-agent training by computing advantages conditioned on each agent's role in the game
* **Automatic curriculum**: Opponents improve continuously, generating progressively harder challenges without human problem curation
* **Transfer to reasoning benchmarks**: Despite training only on games (never seeing math equations), SPIRAL improves performance by up to **10%** across 8 reasoning benchmarks on Qwen and Llama model families
**Results:**
* Outperforms supervised fine-tuning on 25,000 expert game trajectories
* Multi-game training yields stronger transfer than single-game training
* Works on both base and instruction-tuned models
* Chain-of-thought analysis reveals games incentivize strategic planning patterns that transfer to mathematical reasoning
===== Competitive Self-Play =====
In competitive (adversarial) self-play, agents play opposing roles in zero-sum settings:
=== Search Self-Play (SSP) ===
* LLM uses multi-turn search tools to both propose and solve agentic tasks (e.g., QA benchmarks)
* Proposer and solver co-evolve via RLVR (RL with verifiable rewards)
* Solver win rates dynamically adjust proposer difficulty
* Achieves +5-15% on QA benchmarks across model scales
=== Self-play SWE-RL (SSR) ===
* Single agent learns to both inject bugs into and repair real codebases
* Bug injection guided by test patches provides verifiable reward signals
* Achieves **+10.4%** on SWE-bench Verified and **+7.8%** on SWE-bench Pro
* Beats human-data baselines on unseen natural language issues
===== Cooperative Self-Play =====
In cooperative self-play, agents collaborate to generate diverse training scenarios:
* **Multi-agent negotiation**: Iterative self-play in driving simulations learns yielding and signaling behaviors, boosting success from **63% to 98%** via progressive environment diversity
* **Population-based training**: Multiple agent copies explore diverse strategies simultaneously, with successful behaviors shared across the population
* **Joint exploration**: Agents generate environments for each other, expanding the training distribution beyond any fixed dataset
===== Code Example =====
# Simplified SPIRAL-style self-play training loop
import torch
from typing import Tuple
class SelfPlayTrainer:
def __init__(self, model, games, learning_rate=1e-5):
self.model = model
self.opponent = model.copy() # Self-play partner
self.games = games # ["tictactoe", "kuhn_poker", "negotiation"]
self.optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
def play_episode(self, game) -> Tuple[list, float]:
"""Play a multi-turn game, return trajectory and reward."""
state = game.reset()
trajectory = []
for turn in range(game.max_turns):
if turn % 2 == 0:
action = self.model.generate_action(state, role="player1")
else:
action = self.opponent.generate_action(state, role="player2")
state, done = game.step(action)
trajectory.append((state, action))
if done:
break
reward = game.get_reward() # Zero-sum: +1/-1
return trajectory, reward
def role_conditioned_advantage(self, trajectory, reward, role):
"""RAE: compute advantages conditioned on agent role."""
advantages = []
for t, (state, action) in enumerate(trajectory):
if self.get_role(t) == role:
baseline = self.model.value_estimate(state, role=role)
advantages.append(reward - baseline)
return advantages
def train_step(self):
for game in self.games:
trajectory, reward = self.play_episode(game)
advantages = self.role_conditioned_advantage(
trajectory, reward, role="player1"
)
loss = self.compute_policy_gradient_loss(trajectory, advantages)
loss.backward()
self.optimizer.step()
# Periodically update opponent to current model
self.opponent.load_state_dict(self.model.state_dict())
===== Self-Play vs Supervised Fine-Tuning =====
^ Aspect ^ Self-Play ^ Supervised Fine-Tuning ^
| Data source | Self-generated | Human-curated |
| Difficulty scaling | Adaptive (autocurriculum) | Fixed |
| Ceiling | Unbounded (superhuman possible) | Bounded by human demonstrations |
| Reward signal | Verifiable game outcomes | Human labels |
| Cost | Compute-intensive | Annotation-intensive |
===== Mathematical Framework =====
Self-play optimizes a minimax objective in the zero-sum setting:
\max_{\theta} \min_{\phi} \mathbb{E}_{\tau \sim \pi_\theta, \pi_\phi} [R(\tau)]
where \pi_\theta is the learning agent, \pi_\phi is the self-play opponent, and R(\tau) is the trajectory reward. The autocurriculum emerges because as \pi_\theta improves, \pi_\phi (updated periodically) presents harder challenges.
SPIRAL's Role-Conditioned Advantage Estimation:
A^{RAE}(s_t, a_t, r) = Q(s_t, a_t | r) - V(s_t | r)
where r is the role conditioning, ensuring advantages are computed from each player's perspective in the multi-agent game.
===== References =====
* [[https://arxiv.org/abs/2506.24119|Liu et al. "SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn RL" (arXiv:2506.24119)]]
* [[https://arxiv.org/abs/2512.18552|"Self-play SWE-RL: Bug Injection and Repair in Real Codebases" (SSR)]]
* [[https://arxiv.org/abs/2510.18821|"Search Self-Play for LLM Agents"]]
* [[https://machinelearning.apple.com/research/towards-learning-multi-agent-negotiations-via-self-play|Apple ML: Multi-Agent Negotiations via Self-Play]]
===== See Also =====
* [[camel|CAMEL — Role-playing framework for cooperative agent communication]]
* [[metagpt|MetaGPT — Multi-agent collaboration with SOPs]]
* [[agent_distillation|Agent Distillation — Compressing trained agents into smaller models]]