Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Constitutional AI (CAI), introduced by Bai et al. (2022) at Anthropic, is a training methodology that aligns language models to be helpful, harmless, and honest using a set of written principles (a “constitution”) and AI-generated feedback. CAI replaces human harmlessness labels with Reinforcement Learning from AI Feedback (RLAIF), enabling scalable alignment without exposing human annotators to harmful content.
Standard RLHF requires human annotators to label which model outputs are more or less harmful — a process that is expensive, subjective, and psychologically taxing for labelers. CAI asks: can the model itself judge harmfulness, guided by explicit principles? This makes the alignment criteria transparent and auditable rather than implicit in crowd-worker judgments.
The constitution is a set of human-written normative principles that define desired model behavior. These principles are drawn from diverse sources:
Each principle provides a concrete criterion the model uses to evaluate its own outputs. The constitution is fully transparent — users can inspect exactly what rules govern the model's behavior.
The first training phase uses supervised learning on self-revised outputs:
The final revised responses form a supervised fine-tuning dataset, replacing the need for human-written “ideal” responses to harmful prompts.
# Simplified CAI critique-revise loop def critique_revise(model, prompt, constitution, num_rounds=3): response = model.generate(prompt) for _ in range(num_rounds): principle = random.choice(constitution) critique = model.generate( f"Critique this response according to the principle: " f"'{principle}'\n\nResponse: {response}" ) response = model.generate( f"Revise this response based on the critique: " f"'{critique}'\n\nOriginal: {response}" ) return response
The second phase replaces RLHF with Reinforcement Learning from AI Feedback:
The key insight is that while the model cannot reliably generate harmless responses from scratch, it can reliably compare two responses and identify which is less harmful — a simpler discrimination task.
| Phase | Method | Data Source |
| Phase 1: SL-CAI | Supervised fine-tuning on critique-revised outputs | AI self-critique + revision |
| Phase 2: RL-CAI | RL with AI preference labels (RLAIF) | AI pairwise comparisons |
CAI represents a paradigm shift from opaque human preferences to explicit written principles for alignment. The constitution can be publicly shared, debated, and revised, enabling democratic oversight of AI behavior. The technique also demonstrates that sufficiently capable models can supervise their own alignment, a form of scalable oversight critical for aligning increasingly powerful systems.