====== Constitutional AI ======
**Constitutional AI (CAI)**, introduced by Bai et al. (2022) at Anthropic, is a training methodology that aligns language models to be helpful, harmless, and honest using a set of written principles (a "constitution") and AI-generated feedback. CAI replaces human harmlessness labels with **Reinforcement Learning from AI Feedback (RLAIF)**, enabling scalable alignment without exposing human annotators to harmful content.
graph TD
A[Generate Response] --> B[Self-Critique]
B --> C[Apply Constitutional Principle]
C --> D[Revise Response]
D --> E{More Rounds?}
E -->|Yes| B
E -->|No| F[SL Fine-Tuning Dataset]
F --> G[RLAIF Training]
G --> H[Aligned Model]
===== Motivation =====
Standard RLHF requires human annotators to label which model outputs are more or less harmful --- a process that is expensive, subjective, and psychologically taxing for labelers. CAI asks: can the model itself judge harmfulness, guided by explicit principles? This makes the alignment criteria //transparent// and //auditable// rather than implicit in crowd-worker judgments.
===== The Constitution =====
The constitution is a set of human-written normative principles that define desired model behavior. These principles are drawn from diverse sources:
* The Universal Declaration of Human Rights
* Apple's Terms of Service (as an example of corporate guidelines)
* Principles emphasizing helpfulness, honesty, and harm avoidance
* Research-specific norms (e.g., "Choose the response that is least likely to be used for illegal activity")
Each principle provides a concrete criterion the model uses to evaluate its own outputs. The constitution is fully transparent --- users can inspect exactly what rules govern the model's behavior.
===== The Critique-Revise Loop =====
The first training phase uses **supervised learning on self-revised outputs**:
- **Generate**: The model produces a response to a potentially harmful prompt
- **Critique**: The same model critiques its response against a randomly sampled constitutional principle (e.g., "Identify specific ways this response is harmful or unethical")
- **Revise**: The model generates an improved response incorporating the critique
- **Iterate**: Steps 2-3 repeat with different principles for multiple rounds
The final revised responses form a supervised fine-tuning dataset, replacing the need for human-written "ideal" responses to harmful prompts.
# Simplified CAI critique-revise loop
def critique_revise(model, prompt, constitution, num_rounds=3):
response = model.generate(prompt)
for _ in range(num_rounds):
principle = random.choice(constitution)
critique = model.generate(
f"Critique this response according to the principle: "
f"'{principle}'\n\nResponse: {response}"
)
response = model.generate(
f"Revise this response based on the critique: "
f"'{critique}'\n\nOriginal: {response}"
)
return response
===== RLAIF Training =====
The second phase replaces RLHF with **Reinforcement Learning from AI Feedback**:
- **Generate preference pairs**: For each prompt, the model generates multiple candidate responses
- **AI evaluation**: The model scores which response better adheres to constitutional principles, producing preference labels
- **Train reward model**: A reward model is trained on AI-generated preference data (instead of human labels)
- **RL optimization**: The policy model is fine-tuned via PPO against the AI-trained reward model
The key insight is that while the model cannot reliably //generate// harmless responses from scratch, it can reliably //compare// two responses and identify which is less harmful --- a simpler discrimination task.
===== Two-Phase Training Summary =====
| **Phase** | **Method** | **Data Source** |
| Phase 1: SL-CAI | Supervised fine-tuning on critique-revised outputs | AI self-critique + revision |
| Phase 2: RL-CAI | RL with AI preference labels (RLAIF) | AI pairwise comparisons |
===== Key Results =====
* CAI models achieve harmlessness comparable to or exceeding RLHF models trained with human labels
* Helpfulness is preserved --- CAI models do not become evasive or unhelpful
* Models provide substantive explanations for refusals rather than generic deflections
* The approach scales without proportional increases in human annotation effort
* Constitutional principles are auditable, making alignment criteria transparent
===== Significance =====
CAI represents a paradigm shift from //opaque human preferences// to //explicit written principles// for alignment. The constitution can be publicly shared, debated, and revised, enabling democratic oversight of AI behavior. The technique also demonstrates that sufficiently capable models can supervise their own alignment, a form of **scalable oversight** critical for aligning increasingly powerful systems.
===== References =====
* [[https://arxiv.org/abs/2212.08073|Bai et al. "Constitutional AI: Harmlessness from AI Feedback" (2022)]]
* [[https://arxiv.org/abs/2204.05862|Bai et al. "Training a Helpful and Harmless Assistant with RLHF" (2022)]]
* [[https://arxiv.org/abs/2009.01325|Stiennon et al. "Learning to Summarize with Human Feedback" (2020)]]
===== See Also =====
* [[direct_preference_optimization|Direct Preference Optimization (DPO)]]
* [[reward_overoptimization|Reward Overoptimization]]
* [[self_refine|Self-Refine]]