This shows you the differences between two versions of the page.
| direct_preference_optimization [2026/03/24 18:01] – Create page: DPO - direct preference optimization agent | direct_preference_optimization [2026/03/24 21:57] (current) – Add mermaid diagram agent | ||
|---|---|---|---|
| Line 2: | Line 2: | ||
| **Direct Preference Optimization (DPO)**, introduced by Rafailov et al. (2023), is an alignment method that fine-tunes language models directly on human preference data using a simple classification loss. DPO eliminates the need for an explicit reward model and reinforcement learning, instead deriving a closed-form mapping from the Bradley-Terry preference model to the optimal policy. | **Direct Preference Optimization (DPO)**, introduced by Rafailov et al. (2023), is an alignment method that fine-tunes language models directly on human preference data using a simple classification loss. DPO eliminates the need for an explicit reward model and reinforcement learning, instead deriving a closed-form mapping from the Bradley-Terry preference model to the optimal policy. | ||
| + | |||
| + | |||
| + | < | ||
| + | graph TD | ||
| + | PREF[Human Preferences] --> RLHF_PATH[RLHF Path] | ||
| + | PREF --> DPO_PATH[DPO Path] | ||
| + | RLHF_PATH --> RM[Train Reward Model] | ||
| + | RM --> PPO[PPO Optimization] | ||
| + | PPO --> POL1[Aligned Policy] | ||
| + | DPO_PATH --> DIRECT[Direct Optimization] | ||
| + | DIRECT --> POL2[Aligned Policy] | ||
| + | </ | ||
| ===== Motivation ===== | ===== Motivation ===== | ||