====== Constitutional AI ====== **Constitutional AI (CAI)**, introduced by Bai et al. (2022) at Anthropic, is a training methodology that aligns language models to be helpful, harmless, and honest using a set of written principles (a "constitution") and AI-generated feedback. CAI replaces human harmlessness labels with **Reinforcement Learning from AI Feedback (RLAIF)**, enabling scalable alignment without exposing human annotators to harmful content. graph TD A[Generate Response] --> B[Self-Critique] B --> C[Apply Constitutional Principle] C --> D[Revise Response] D --> E{More Rounds?} E -->|Yes| B E -->|No| F[SL Fine-Tuning Dataset] F --> G[RLAIF Training] G --> H[Aligned Model] ===== Motivation ===== Standard RLHF requires human annotators to label which model outputs are more or less harmful --- a process that is expensive, subjective, and psychologically taxing for labelers. CAI asks: can the model itself judge harmfulness, guided by explicit principles? This makes the alignment criteria //transparent// and //auditable// rather than implicit in crowd-worker judgments. ===== The Constitution ===== The constitution is a set of human-written normative principles that define desired model behavior. These principles are drawn from diverse sources: * The Universal Declaration of Human Rights * Apple's Terms of Service (as an example of corporate guidelines) * Principles emphasizing helpfulness, honesty, and harm avoidance * Research-specific norms (e.g., "Choose the response that is least likely to be used for illegal activity") Each principle provides a concrete criterion the model uses to evaluate its own outputs. The constitution is fully transparent --- users can inspect exactly what rules govern the model's behavior. ===== The Critique-Revise Loop ===== The first training phase uses **supervised learning on self-revised outputs**: - **Generate**: The model produces a response to a potentially harmful prompt - **Critique**: The same model critiques its response against a randomly sampled constitutional principle (e.g., "Identify specific ways this response is harmful or unethical") - **Revise**: The model generates an improved response incorporating the critique - **Iterate**: Steps 2-3 repeat with different principles for multiple rounds The final revised responses form a supervised fine-tuning dataset, replacing the need for human-written "ideal" responses to harmful prompts. # Simplified CAI critique-revise loop def critique_revise(model, prompt, constitution, num_rounds=3): response = model.generate(prompt) for _ in range(num_rounds): principle = random.choice(constitution) critique = model.generate( f"Critique this response according to the principle: " f"'{principle}'\n\nResponse: {response}" ) response = model.generate( f"Revise this response based on the critique: " f"'{critique}'\n\nOriginal: {response}" ) return response ===== RLAIF Training ===== The second phase replaces RLHF with **Reinforcement Learning from AI Feedback**: - **Generate preference pairs**: For each prompt, the model generates multiple candidate responses - **AI evaluation**: The model scores which response better adheres to constitutional principles, producing preference labels - **Train reward model**: A reward model is trained on AI-generated preference data (instead of human labels) - **RL optimization**: The policy model is fine-tuned via PPO against the AI-trained reward model The key insight is that while the model cannot reliably //generate// harmless responses from scratch, it can reliably //compare// two responses and identify which is less harmful --- a simpler discrimination task. ===== Two-Phase Training Summary ===== | **Phase** | **Method** | **Data Source** | | Phase 1: SL-CAI | Supervised fine-tuning on critique-revised outputs | AI self-critique + revision | | Phase 2: RL-CAI | RL with AI preference labels (RLAIF) | AI pairwise comparisons | ===== Key Results ===== * CAI models achieve harmlessness comparable to or exceeding RLHF models trained with human labels * Helpfulness is preserved --- CAI models do not become evasive or unhelpful * Models provide substantive explanations for refusals rather than generic deflections * The approach scales without proportional increases in human annotation effort * Constitutional principles are auditable, making alignment criteria transparent ===== Significance ===== CAI represents a paradigm shift from //opaque human preferences// to //explicit written principles// for alignment. The constitution can be publicly shared, debated, and revised, enabling democratic oversight of AI behavior. The technique also demonstrates that sufficiently capable models can supervise their own alignment, a form of **scalable oversight** critical for aligning increasingly powerful systems. ===== References ===== * [[https://arxiv.org/abs/2212.08073|Bai et al. "Constitutional AI: Harmlessness from AI Feedback" (2022)]] * [[https://arxiv.org/abs/2204.05862|Bai et al. "Training a Helpful and Harmless Assistant with RLHF" (2022)]] * [[https://arxiv.org/abs/2009.01325|Stiennon et al. "Learning to Summarize with Human Feedback" (2020)]] ===== See Also ===== * [[direct_preference_optimization|Direct Preference Optimization (DPO)]] * [[reward_overoptimization|Reward Overoptimization]] * [[self_refine|Self-Refine]]