This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| constitutional_ai [2026/03/24 17:57] – Create page on Constitutional AI (Bai et al. 2022) agent | constitutional_ai [2026/03/24 21:57] (current) – Add mermaid diagram agent | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Constitutional AI ====== | ====== Constitutional AI ====== | ||
| - | **Constitutional AI (CAI)**, introduced by Bai et al. (2022) at Anthropic, is a training methodology that aligns language models to be helpful | + | **Constitutional AI (CAI)**, introduced by Bai et al. (2022) at Anthropic, is a training methodology that aligns language models to be helpful, harmless, and honest |
| + | |||
| + | |||
| + | < | ||
| + | graph TD | ||
| + | A[Generate Response] --> B[Self-Critique] | ||
| + | B --> C[Apply Constitutional Principle] | ||
| + | C --> D[Revise Response] | ||
| + | D --> E{More Rounds?} | ||
| + | E -->|Yes| B | ||
| + | E -->|No| F[SL Fine-Tuning Dataset] | ||
| + | F --> G[RLAIF Training] | ||
| + | G --> H[Aligned Model] | ||
| + | </ | ||
| + | |||
| + | ===== Motivation ===== | ||
| + | |||
| + | Standard RLHF requires | ||
| ===== The Constitution ===== | ===== The Constitution ===== | ||
| - | The constitution is a curated | + | The constitution is a set of human-written |
| * The Universal Declaration of Human Rights | * The Universal Declaration of Human Rights | ||
| * Apple' | * Apple' | ||
| - | * Principles emphasizing helpfulness, | + | * Principles emphasizing helpfulness, |
| - | * Custom research-specific | + | * Research-specific |
| - | Each principle provides | + | Each principle provides |
| - | The constitution makes the alignment criteria **transparent and auditable**, | + | ===== The Critique-Revise Loop ===== |
| - | ===== Phase 1: Supervised Critique-Revise Learning ===== | + | The first training phase uses **supervised learning on self-revised outputs**: |
| - | The first training phase generates | + | - **Generate**: |
| + | - **Critique**: | ||
| + | - **Revise**: The model generates an improved response incorporating the critique | ||
| + | - **Iterate**: Steps 2-3 repeat with different principles for multiple rounds | ||
| - | - **Generate**: | + | The final revised responses form a supervised fine-tuning dataset, |
| - | - **Critique**: | + | |
| - | - **Revise**: The model generates an improved response incorporating the critique: //" | + | |
| - | + | ||
| - | This critique-revise loop can be repeated multiple times, with different principles sampled each iteration. | + | |
| <code python> | <code python> | ||
| - | # Simplified | + | # Simplified |
| - | def constitutional_critique_revise(model, prompt, constitution, | + | def critique_revise(model, prompt, constitution, |
| response = model.generate(prompt) | response = model.generate(prompt) | ||
| - | for i in range(n_revisions): | + | for _ in range(num_rounds): |
| principle = random.choice(constitution) | principle = random.choice(constitution) | ||
| - | # Critique step | ||
| critique = model.generate( | critique = model.generate( | ||
| - | f"Given the principle: ' | + | f"Critique this response according to the principle: |
| - | " | + | f"' |
| - | f" | + | |
| ) | ) | ||
| - | # Revise step | ||
| response = model.generate( | response = model.generate( | ||
| - | f"Based on this critique: | + | f"Revise this response based on the critique: " |
| - | " | + | f"' |
| - | f"Revise the response to better align with: '{principle}' | + | |
| - | " | + | |
| - | f" | + | |
| ) | ) | ||
| + | |||
| return response | return response | ||
| </ | </ | ||
| - | ===== Phase 2: RLAIF Training ===== | + | ===== RLAIF Training ===== |
| - | The second phase uses reinforcement learning, but replaces | + | The second phase replaces |
| - | - **Generate | + | - **Generate |
| - | - **AI preference labeling**: Ask the model to evaluate | + | - **AI evaluation**: The model scores |
| - | - **Train reward model**: | + | - **Train reward model**: |
| - | - **RL fine-tuning**: Optimize the policy model against the AI-trained reward model using PPO (Proximal Policy Optimization). | + | - **RL optimization**: The policy model is fine-tuned via PPO against the AI-trained reward model |
| - | The preference labeling uses a format like: //" | + | The key insight is that while the model cannot reliably |
| - | ===== RLAIF vs RLHF ===== | + | ===== Two-Phase Training Summary |
| - | The key differences between RLAIF and traditional RLHF: | + | | **Phase** | **Method** | **Data Source** | |
| - | + | | Phase 1: SL-CAI | Supervised fine-tuning on critique-revised outputs | |
| - | | ^ **RLHF** ^ **RLAIF (Constitutional AI)** ^ | + | | Phase 2: RL-CAI |
| - | | **Feedback source** | Human annotators | + | |
| - | | **Harmlessness labels** | Required from humans | Generated by AI | | + | |
| - | | **Transparency** | + | |
| - | | **Scalability** | Limited by human annotation cost | Scales | + | |
| - | | **Worker exposure** | Annotators see harmful content | No human exposure needed | + | |
| ===== Key Results ===== | ===== Key Results ===== | ||
| - | * CAI models | + | * CAI models |
| - | * Models | + | * Helpfulness is preserved --- CAI models do not become evasive or unhelpful |
| - | * The approach | + | * Models |
| - | * Constitutional principles | + | * The approach |
| - | * RLAIF preference labels correlate well with human preferences, validating the AI feedback approach | + | * Constitutional principles |
| - | + | ||
| - | ===== Mathematical Framework ===== | + | |
| - | + | ||
| - | The RLAIF objective mirrors standard RLHF. Given a reward model $r_\phi$ trained on AI preferences, | + | |
| - | $$\max_{\pi_ heta} \mathbb{E}_{x \sim \mathcal{D}, | + | ===== Significance ===== |
| - | ight] - eta \, ext{KL}\left[\pi_ heta \| \pi_{ ext{ref}} | + | |
| - | ight]$$ | + | |
| - | where $eta$ controls the KL penalty against the reference policy $\pi_{ ext{ref}}$. The difference is that $r_\phi$ is trained on AI-generated labels rather than human labels. | + | CAI represents a paradigm shift from //opaque human preferences// |
| ===== References ===== | ===== References ===== | ||
| * [[https:// | * [[https:// | ||
| - | * [[https:// | + | * [[https:// |
| - | * [[https:// | + | * [[https:// |
| ===== See Also ===== | ===== See Also ===== | ||
| - | * [[direct_preference_optimization|Direct Preference Optimization]] | + | * [[direct_preference_optimization|Direct Preference Optimization |
| * [[reward_overoptimization|Reward Overoptimization]] | * [[reward_overoptimization|Reward Overoptimization]] | ||
| * [[self_refine|Self-Refine]] | * [[self_refine|Self-Refine]] | ||