Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Reinforcement Learning from Human Feedback (RLHF) is a training methodology that aligns large language models with human preferences by using human evaluations as a reward signal. The process transforms subjective human judgments about response quality into learnable optimization objectives, enabling models to develop behaviors like helpfulness, honesty, and harmlessness that are difficult to specify through traditional loss functions. RLHF has been instrumental in shaping the behavior of1) autonomous agents and conversational agents, and has evolved significantly since its introduction in InstructGPT. As tasks became more complex and required verifiable reward structures, RLHF gradually became superseded by RLVR approaches2).
The original RLHF approach, formalized in OpenAI's InstructGPT3), follows three stages:
$$\mathcal{L}_{\text{PPO}}(\theta) = \mathbb{E}_{(x,y)\sim\pi_\theta}\left[r_\phi(x, y) - \beta \, D_{\text{KL}}\!\left(\pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x)\right)\right]$$
where $\pi_\theta$ is the policy being optimized, $\pi_{\text{ref}}$ is the reference (SFT) policy, and $\beta$ controls the strength of the KL penalty.
This pipeline demonstrated that subjective qualities, “helpful,” “accurate,” “harmless”, could be distilled into learnable signals, producing ChatGPT and other aligned models.
The reward model is trained on human preference data using the Bradley-Terry model of pairwise comparisons. Given a prompt $x$ and two responses $y_w$ (preferred) and $y_l$ (dispreferred), the loss is:
$$\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$
where $\sigma$ is the sigmoid function. This trains the reward model to assign higher scores to human-preferred responses.
DPO (Rafailov et al., 20235) simplifies RLHF by eliminating the explicit reward model. Instead, it directly optimizes the policy on preference data using a loss function that implicitly learns the reward:
$$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$
Key advantages:
DPO has become the preferred alignment method for many open-source model trainers due to its simplicity, though debate continues about whether it matches PPO-based RLHF on the most demanding alignment tasks.
RLAIF (Reinforcement Learning from AI Feedback) replaces human annotators with AI-generated preferences, typically from a stronger model evaluating a weaker model's outputs. This approach:
Google and Anthropic have demonstrated that RLAIF can produce alignment quality competitive with human feedback for many tasks.
Anthropic's Constitutional AI (Bai et al., 20226) extends RLAIF by having the model self-critique its outputs against a written “constitution”, a set of principles defining desired behavior (e.g., “be helpful but avoid harm”). The process:
This method reduces reliance on human labeling while providing explicit, auditable alignment criteria.
Group Relative Policy Optimization (GRPO), used in training DeepSeek R1 (DeepSeek-AI et al., 20257), represents a 2025 advancement that enhances RLHF stability for large reasoning models. For a prompt $x$, GRPO samples a group of $G$ responses $\{y_1, \ldots, y_G\}$ and computes group-normalized advantages:
$$\hat{A}_i = \frac{r(y_i) - \text{mean}(\{r(y_j)\}_{j=1}^G)}{\text{std}(\{r(y_j)\}_{j=1}^G)}$$
The policy is updated via a clipped surrogate objective:
$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}\hat{A}_i,\;\text{clip}\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}, 1-\[[epsilon|epsilon]], 1+\epsilon\right)\hat{A}_i\right)\right]$$
Key properties:
A critical distinction in modern RLHF concerns what gets evaluated:
PRMs are increasingly used for training chain-of-thought agents and autonomous agents where the quality of the reasoning process matters as much as the outcome.
Reinforcement Learning from Verifiable Rewards (RLVR) is an emerging technique that uses objective, programmatically verifiable signals instead of subjective human preferences:
RLVR eliminates subjective bias entirely and addresses reward hacking by using ground-truth evaluation, though it is limited to domains where automated verification is possible.
A persistent challenge in RLHF is reward hacking, where agents learn to exploit flaws in the reward model rather than genuinely satisfying human preferences:
Mitigation strategies include KL divergence penalties, reward model ensembles, adversarial training, and hybrid approaches combining learned rewards with verifiable signals (RLVR).
RLHF directly influences how agents behave in production:
Comprehensive educational materials on RLHF have been developed to support practitioners and researchers. A dedicated book and accompanying resources8) provide technical foundations covering reward modeling, policy gradient algorithms, rejection sampling, and practical implementation. These resources are supported by a website, YouTube lecture series, and open-source codebase, making RLHF techniques more accessible to the broader AI community.
import torch import torch.nn.functional as F def dpo_loss(policy_logprobs_chosen, policy_logprobs_rejected, ref_logprobs_chosen, ref_logprobs_rejected, beta=0.1): """Compute Direct Preference Optimization loss. Args: policy_logprobs_chosen: Log-probs of chosen responses under policy model. policy_logprobs_rejected: Log-probs of rejected responses under policy model. ref_logprobs_chosen: Log-probs of chosen responses under reference model. ref_logprobs_rejected: Log-probs of rejected responses under reference model. beta: Temperature parameter controlling deviation from reference policy. Returns: Scalar DPO loss value. """ chosen_rewards = beta * (policy_logprobs_chosen - ref_logprobs_chosen) rejected_rewards = beta * (policy_logprobs_rejected - ref_logprobs_rejected) loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean() return loss Simulated log-probabilities for a batch of 4 preference pairs batch_size = 4 policy_chosen = torch.randn(batch_size, requires_grad=True) policy_rejected = torch.randn(batch_size, requires_grad=True) ref_chosen = torch.randn(batch_size) ref_rejected = torch.randn(batch_size) loss = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected) loss.backward() print(f"DPO Loss: {loss.item():.4f}")