Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Constitutional AI (CAI), introduced by Bai et al. (2022) at Anthropic, is a training methodology that aligns language models to be helpful and harmless using a set of explicit principles (a “constitution”) combined with AI-generated feedback. The key innovation is replacing human harmlessness labels with Reinforcement Learning from AI Feedback (RLAIF), where the model critiques and revises its own outputs according to constitutional principles. This eliminates the need for human annotators to evaluate harmful content while producing models that are both safer and more transparent in their refusals.
The constitution is a curated set of normative principles that define acceptable model behavior. These principles are drawn from diverse sources:
Each principle provides an explicit, human-readable rule such as: “Choose the response that is least likely to be used for illegal or harmful activities” or “Choose the response that most supports the autonomy and freedom of the user.”
The constitution makes the alignment criteria transparent and auditable, unlike RLHF where human preferences are opaque and potentially inconsistent.
The first training phase generates improved data through an iterative self-critique loop:
This critique-revise loop can be repeated multiple times, with different principles sampled each iteration. The final revised responses, paired with the original prompts, form a supervised fine-tuning dataset. The model is then fine-tuned on these (prompt, revised-response) pairs.
# Simplified Constitutional AI critique-revise loop def constitutional_critique_revise(model, prompt, constitution, n_revisions=3): response = model.generate(prompt) for i in range(n_revisions): principle = random.choice(constitution) # Critique step critique = model.generate( f"Given the principle: '{principle}' " f"Critique this response: {response}" ) # Revise step response = model.generate( f"Based on this critique: {critique} " f"Revise the response to better align with: '{principle}' " f"Original response: {response}" ) return response
The second phase uses reinforcement learning, but replaces human preference labels with AI-generated ones:
The preference labeling uses a format like: “Consider the principle: [principle]. Which response is better? (A) or (B)”
The key differences between RLAIF and traditional RLHF:
| RLHF | RLAIF (Constitutional AI) | |
|---|---|---|
| Feedback source | Human annotators | AI self-evaluation |
| Harmlessness labels | Required from humans | Generated by AI |
| Transparency | Implicit in annotator judgment | Explicit in constitution |
| Scalability | Limited by human annotation cost | Scales with compute |
| Worker exposure | Annotators see harmful content | No human exposure needed |
The RLAIF objective mirrors standard RLHF. Given a reward model $r_\phi$ trained on AI preferences, the policy $\pi_ heta$ is optimized:
$$\max_{\pi_ heta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_ heta(\cdot|x)} \left[ r_\phi(x, y) ight] - eta \, ext{KL}\left[\pi_ heta \| \pi_{ ext{ref}} ight]$$
where $eta$ controls the KL penalty against the reference policy $\pi_{ ext{ref}}$. The difference is that $r_\phi$ is trained on AI-generated labels rather than human labels.