Table of Contents

Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is a training methodology that aligns large language models with human preferences by using human evaluations as a reward signal. The process transforms subjective human judgments about response quality into learnable optimization objectives, enabling models to develop behaviors like helpfulness, honesty, and harmlessness that are difficult to specify through traditional loss functions. RLHF has been instrumental in shaping the behavior of1) autonomous agents and conversational agents, and has evolved significantly since its introduction in InstructGPT. As tasks became more complex and required verifiable reward structures, RLHF gradually became superseded by RLVR approaches2).

graph LR subgraph Phase 1: SFT PT[Pretrained LLM] --> SFT[Supervised Fine-Tuning] HD[Human Demos] --> SFT end subgraph Phase 2: Reward Model SFT --> Gen[Generate Response Pairs] Gen --> HP[Human Preferences] HP --> RM[Train Reward Model] end subgraph Phase 3: RL Optimization SFT --> Policy[Policy Model] RM --> PPO[PPO Training] Policy --> PPO PPO --> Aligned[Aligned Model] end

The Classic RLHF Pipeline

The original RLHF approach, formalized in OpenAI's InstructGPT3), follows three stages:

$$\mathcal{L}_{\text{PPO}}(\theta) = \mathbb{E}_{(x,y)\sim\pi_\theta}\left[r_\phi(x, y) - \beta \, D_{\text{KL}}\!\left(\pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x)\right)\right]$$

where $\pi_\theta$ is the policy being optimized, $\pi_{\text{ref}}$ is the reference (SFT) policy, and $\beta$ controls the strength of the KL penalty.

This pipeline demonstrated that subjective qualities, “helpful,” “accurate,” “harmless”, could be distilled into learnable signals, producing ChatGPT and other aligned models.

The Reward Model

The reward model is trained on human preference data using the Bradley-Terry model of pairwise comparisons. Given a prompt $x$ and two responses $y_w$ (preferred) and $y_l$ (dispreferred), the loss is:

$$\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$

where $\sigma$ is the sigmoid function. This trains the reward model to assign higher scores to human-preferred responses.

DPO: Direct Preference Optimization

DPO (Rafailov et al., 20235) simplifies RLHF by eliminating the explicit reward model. Instead, it directly optimizes the policy on preference data using a loss function that implicitly learns the reward:

$$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$

Key advantages:

DPO has become the preferred alignment method for many open-source model trainers due to its simplicity, though debate continues about whether it matches PPO-based RLHF on the most demanding alignment tasks.

RLAIF: AI-Generated Feedback

RLAIF (Reinforcement Learning from AI Feedback) replaces human annotators with AI-generated preferences, typically from a stronger model evaluating a weaker model's outputs. This approach:

Google and Anthropic have demonstrated that RLAIF can produce alignment quality competitive with human feedback for many tasks.

Constitutional AI

Anthropic's Constitutional AI (Bai et al., 20226) extends RLAIF by having the model self-critique its outputs against a written “constitution”, a set of principles defining desired behavior (e.g., “be helpful but avoid harm”). The process:

This method reduces reliance on human labeling while providing explicit, auditable alignment criteria.

GRPO and DeepSeek's Approach

Group Relative Policy Optimization (GRPO), used in training DeepSeek R1 (DeepSeek-AI et al., 20257), represents a 2025 advancement that enhances RLHF stability for large reasoning models. For a prompt $x$, GRPO samples a group of $G$ responses $\{y_1, \ldots, y_G\}$ and computes group-normalized advantages:

$$\hat{A}_i = \frac{r(y_i) - \text{mean}(\{r(y_j)\}_{j=1}^G)}{\text{std}(\{r(y_j)\}_{j=1}^G)}$$

The policy is updated via a clipped surrogate objective:

$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}\hat{A}_i,\;\text{clip}\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}, 1-\[[epsilon|epsilon]], 1+\epsilon\right)\hat{A}_i\right)\right]$$

Key properties:

Process vs. Outcome Reward Models

A critical distinction in modern RLHF concerns what gets evaluated:

PRMs are increasingly used for training chain-of-thought agents and autonomous agents where the quality of the reasoning process matters as much as the outcome.

RLVR: Verifiable Rewards

Reinforcement Learning from Verifiable Rewards (RLVR) is an emerging technique that uses objective, programmatically verifiable signals instead of subjective human preferences:

RLVR eliminates subjective bias entirely and addresses reward hacking by using ground-truth evaluation, though it is limited to domains where automated verification is possible.

Reward Hacking

A persistent challenge in RLHF is reward hacking, where agents learn to exploit flaws in the reward model rather than genuinely satisfying human preferences:

Mitigation strategies include KL divergence penalties, reward model ensembles, adversarial training, and hybrid approaches combining learned rewards with verifiable signals (RLVR).

How RLHF Shapes Agent Behavior

RLHF directly influences how agents behave in production:

Learning Resources

Comprehensive educational materials on RLHF have been developed to support practitioners and researchers. A dedicated book and accompanying resources8) provide technical foundations covering reward modeling, policy gradient algorithms, rejection sampling, and practical implementation. These resources are supported by a website, YouTube lecture series, and open-source codebase, making RLHF techniques more accessible to the broader AI community.

Code Example: DPO Loss Computation

import torch
import torch.nn.functional as F
 
 
def dpo_loss(policy_logprobs_chosen, policy_logprobs_rejected,
             ref_logprobs_chosen, ref_logprobs_rejected, beta=0.1):
    """Compute Direct Preference Optimization loss.
 
    Args:
        policy_logprobs_chosen: Log-probs of chosen responses under policy model.
        policy_logprobs_rejected: Log-probs of rejected responses under policy model.
        ref_logprobs_chosen: Log-probs of chosen responses under reference model.
        ref_logprobs_rejected: Log-probs of rejected responses under reference model.
        beta: Temperature parameter controlling deviation from reference policy.
 
    Returns:
        Scalar DPO loss value.
    """
    chosen_rewards = beta * (policy_logprobs_chosen - ref_logprobs_chosen)
    rejected_rewards = beta * (policy_logprobs_rejected - ref_logprobs_rejected)
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    return loss
 
 
Simulated log-probabilities for a batch of 4 preference pairs
batch_size = 4
policy_chosen = torch.randn(batch_size, requires_grad=True)
policy_rejected = torch.randn(batch_size, requires_grad=True)
ref_chosen = torch.randn(batch_size)
ref_rejected = torch.randn(batch_size)
 
loss = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
loss.backward()
print(f"DPO Loss: {loss.item():.4f}")

See Also

References

1)
arxiv.org/abs/1706.03741|Christiano, P. et al. - Deep Reinforcement Learning from Human Preferences (2017]]