Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Self-Play fIne-tuNing (SPIN) is an iterative alignment method introduced by Chen et al. (2024) that converts weak language models into strong ones through a self-play mechanism, without requiring additional human-annotated preference data beyond an initial supervised fine-tuning (SFT) dataset. Inspired by self-play in game theory (e.g., AlphaGo Zero), SPIN enables an LLM to refine itself by distinguishing its own generated responses from human-written ones across successive iterations.
SPIN frames LLM alignment as a two-player game:
At each iteration, the opponent generates responses to SFT prompts, and the main player is trained to prefer the original human responses over these synthetic generations. The process converges when the main player can no longer distinguish between human and self-generated responses – i.e., the model's output distribution matches the target data distribution.
The SPIN objective uses a logistic loss similar to DPO-style preference optimization:
<latex> \mathcal{L}_{\text{SPIN}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right] </latex>
Where:
y_w is the winning (human/SFT) responsey_l is the losing (self-generated) responsepi_ref is the opponent (previous iteration)beta controls KL regularization strengthsigma is the sigmoid functionThe global optimum is provably achieved only when the LLM policy aligns with the target human data distribution.
SPIN has provable convergence guarantees. The training objective reaches its global optimum when the model's generation distribution matches the human data distribution exactly. In practice, convergence typically occurs within 3 iterations, after which additional training yields diminishing returns. The logistic loss formulation prevents value function divergence, and KL regularization ensures stability between iterations.
from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset # SPIN iterative training loop (simplified) def spin_iteration(model_path, sft_dataset, iteration, beta=0.1): # Load current model (opponent) and create training copy (main player) opponent = AutoModelForCausalLM.from_pretrained(model_path) main_player = AutoModelForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) preference_pairs = [] for example in sft_dataset: prompt = example["prompt"] human_response = example["response"] # y_w (chosen) # Generate synthetic response from opponent inputs = tokenizer(prompt, return_tensors="pt") synthetic = opponent.generate(**inputs, max_new_tokens=512) synthetic_response = tokenizer.decode(synthetic[0]) # y_l (rejected) preference_pairs.append({ "prompt": prompt, "chosen": human_response, "rejected": synthetic_response, }) # Train main player with DPO-style logistic loss on preference pairs # Uses standard DPO trainer with opponent as reference model from trl import DPOTrainer, DPOConfig config = DPOConfig(beta=beta, output_dir=f"spin_iter_{iteration}") trainer = DPOTrainer( model=main_player, ref_model=opponent, train_dataset=preference_pairs, tokenizer=tokenizer, args=config, ) trainer.train() return main_player # Run 3 SPIN iterations model_path = "alignment-handbook/zephyr-7b-sft-full" dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train") for i in range(3): model = spin_iteration(model_path, dataset, iteration=i) model_path = f"spin_iter_{i}"
| Aspect | SPIN | DPO | RLHF |
|---|---|---|---|
| Data Requirements | Only SFT dataset | Explicit preference pairs | Human preferences + reward model |
| External Feedback | None (self-generated) | Human/GPT-4 preferences | Human annotators |
| Training Method | Iterative self-play with logistic loss | Single-step direct preference optimization | Multi-stage: reward model then PPO |
| Complexity | SFT-level, no RL needed | Simpler than RLHF | Most complex (reward model + PPO) |
| Iterations | Multi-iteration (2-4) | Single iteration | Typically single post-reward |
zephyr-7b-sft-full, SPIN iteration 1 matches or surpasses DPO trained with extra GPT-4 preference data