The Classic RLHF Pipeline
The Reward Model
DPO: Direct Preference Optimization
RLAIF: AI-Generated Feedback
Constitutional AI
GRPO and DeepSeek's Approach
Process vs. Outcome Reward Models
RLVR: Verifiable Rewards
Reward Hacking
How RLHF Shapes Agent Behavior
Learning Resources
Code Example: DPO Loss Computation
See Also
References

Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is a training methodology that aligns large language models with human preferences by using human evaluations as a reward signal. The process transforms subjective human judgments about response quality into learnable optimization objectives, enabling models to develop behaviors like helpfulness, honesty, and harmlessness that are difficult to specify through traditional loss functions. RLHF has been instrumental in shaping the behavior of¹⁾ autonomous agents and conversational agents, and has evolved significantly since its introduction in InstructGPT. As tasks became more complex and required verifiable reward structures, RLHF gradually became superseded by RLVR approaches²⁾.

graph LR subgraph Phase 1: SFT PT[Pretrained LLM] --> SFT[Supervised Fine-Tuning] HD[Human Demos] --> SFT end subgraph Phase 2: Reward Model SFT --> Gen[Generate Response Pairs] Gen --> HP[Human Preferences] HP --> RM[Train Reward Model] end subgraph Phase 3: RL Optimization SFT --> Policy[Policy Model] RM --> PPO[PPO Training] Policy --> PPO PPO --> Aligned[Aligned Model] end

The Classic RLHF Pipeline

The original RLHF approach, formalized in OpenAI's InstructGPT³⁾, follows three stages:

Supervised Fine-Tuning (SFT): A pre-trained LLM is fine-tuned on human-written prompt-response pairs to establish baseline instruction-following behavior
Reward Model Training: Human annotators compare pairs of model outputs and indicate which is preferred. These preference rankings train a separate reward model $r_\phi(x, y)$ that predicts a scalar score representing human approval for prompt $x$ and response $y$.
RL Optimization with PPO: The SFT model is further optimized using Proximal Policy Optimization (PPO)⁴⁾ to maximize the expected reward from the trained reward model while staying close to the SFT distribution via a KL penalty. The PPO objective is:

$$\mathcal{L}_{\text{PPO}}(\theta) = \mathbb{E}_{(x,y)\sim\pi_\theta}\left[r_\phi(x, y) - \beta \, D_{\text{KL}}\!\left(\pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x)\right)\right]$$

where $\pi_\theta$ is the policy being optimized, $\pi_{\text{ref}}$ is the reference (SFT) policy, and $\beta$ controls the strength of the KL penalty.

This pipeline demonstrated that subjective qualities, “helpful,” “accurate,” “harmless”, could be distilled into learnable signals, producing ChatGPT and other aligned models.

The Reward Model

The reward model is trained on human preference data using the Bradley-Terry model of pairwise comparisons. Given a prompt $x$ and two responses $y_w$ (preferred) and $y_l$ (dispreferred), the loss is:

$$\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$

where $\sigma$ is the sigmoid function. This trains the reward model to assign higher scores to human-preferred responses.

DPO: Direct Preference Optimization

DPO (Rafailov et al., 2023⁵⁾ simplifies RLHF by eliminating the explicit reward model. Instead, it directly optimizes the policy on preference data using a loss function that implicitly learns the reward:

$$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$

Key advantages:

No separate reward model training required
Eliminates PPO instability and hyperparameter sensitivity
Computationally cheaper and simpler to implement
Produces comparable alignment quality to full RLHF in many settings

DPO has become the preferred alignment method for many open-source model trainers due to its simplicity, though debate continues about whether it matches PPO-based RLHF on the most demanding alignment tasks.

RLAIF: AI-Generated Feedback

RLAIF (Reinforcement Learning from AI Feedback) replaces human annotators with AI-generated preferences, typically from a stronger model evaluating a weaker model's outputs. This approach:

Scales feedback generation dramatically compared to human annotation
Reduces costs by 90%+ while approximating human preference quality
Enables continuous alignment iteration without human bottlenecks
Can incorporate structured evaluation criteria that humans might apply inconsistently

Google and Anthropic have demonstrated that RLAIF can produce alignment quality competitive with human feedback for many tasks.

Constitutional AI

Anthropic's Constitutional AI (Bai et al., 2022⁶⁾ extends RLAIF by having the model self-critique its outputs against a written “constitution”, a set of principles defining desired behavior (e.g., “be helpful but avoid harm”). The process:

The model generates responses, then critiques them against constitutional principles
Revised responses are generated incorporating the critique
The revised pairs serve as training data for preference optimization

This method reduces reliance on human labeling while providing explicit, auditable alignment criteria.

GRPO and DeepSeek's Approach

Group Relative Policy Optimization (GRPO), used in training DeepSeek R1 (DeepSeek-AI et al., 2025⁷⁾, represents a 2025 advancement that enhances RLHF stability for large reasoning models. For a prompt $x$, GRPO samples a group of $G$ responses $\{y_1, \ldots, y_G\}$ and computes group-normalized advantages:

$$\hat{A}_i = \frac{r(y_i) - \text{mean}(\{r(y_j)\}_{j=1}^G)}{\text{std}(\{r(y_j)\}_{j=1}^G)}$$

The policy is updated via a clipped surrogate objective:

$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}\hat{A}_i,\;\text{clip}\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}, 1-\[[epsilon|epsilon]], 1+\epsilon\right)\hat{A}_i\right)\right]$$

Key properties:

Uses relative ranking within groups of generated outputs rather than absolute reward scores
Better suited for training chain-of-thought reasoning capabilities
No critic network needed, saving ~50% memory over PPO
Contributed to DeepSeek R1's strong performance on math and reasoning benchmarks

Process vs. Outcome Reward Models

A critical distinction in modern RLHF concerns what gets evaluated:

Outcome Reward Models (ORMs): Score only the final output, rewarding correct answers regardless of the reasoning path. Simple but can reward lucky guesses.
Process Reward Models (PRMs): Evaluate intermediate reasoning steps, rewarding sound methodology even when final answers are wrong. For a trajectory $\tau = (s_0, a_0, \ldots, s_T, a_T)$, a PRM assigns per-step rewards $r(s_t, a_t)$ for each step $t$. Better suited for training agents that need reliable reasoning.

PRMs are increasingly used for training chain-of-thought agents and autonomous agents where the quality of the reasoning process matters as much as the outcome.

RLVR: Verifiable Rewards

Reinforcement Learning from Verifiable Rewards (RLVR) is an emerging technique that uses objective, programmatically verifiable signals instead of subjective human preferences:

Math problems verified by checking the answer
Code evaluated by running test suites
Factual claims verified against databases

RLVR eliminates subjective bias entirely and addresses reward hacking by using ground-truth evaluation, though it is limited to domains where automated verification is possible.

Reward Hacking

A persistent challenge in RLHF is reward hacking, where agents learn to exploit flaws in the reward model rather than genuinely satisfying human preferences:

Generating verbose but unhelpful responses that score well on length-correlated reward models
Producing confident-sounding but incorrect answers
Gaming specific patterns the reward model has learned to prefer

Mitigation strategies include KL divergence penalties, reward model ensembles, adversarial training, and hybrid approaches combining learned rewards with verifiable signals (RLVR).

How RLHF Shapes Agent Behavior

RLHF directly influences how agents behave in production:

Helpfulness vs. Safety Tradeoffs: RLHF determines where models draw the line between being maximally helpful and refusing potentially harmful requests
Tool Use Patterns: Reward signals shape how agents decide when to use tools versus relying on internal knowledge
Reasoning Quality: Process reward models incentivize thorough chain-of-thought reasoning over shortcut answers
Conversation Style: RLHF tunes the tone, verbosity, and interaction patterns of conversational agents

Learning Resources

Comprehensive educational materials on RLHF have been developed to support practitioners and researchers. A dedicated book and accompanying resources⁸⁾ provide technical foundations covering reward modeling, policy gradient algorithms, rejection sampling, and practical implementation. These resources are supported by a website, YouTube lecture series, and open-source codebase, making RLHF techniques more accessible to the broader AI community.

Code Example: DPO Loss Computation

import torch
import torch.nn.functional as F
 
 
def dpo_loss(policy_logprobs_chosen, policy_logprobs_rejected,
             ref_logprobs_chosen, ref_logprobs_rejected, beta=0.1):
    """Compute Direct Preference Optimization loss.
 
    Args:
        policy_logprobs_chosen: Log-probs of chosen responses under policy model.
        policy_logprobs_rejected: Log-probs of rejected responses under policy model.
        ref_logprobs_chosen: Log-probs of chosen responses under reference model.
        ref_logprobs_rejected: Log-probs of rejected responses under reference model.
        beta: Temperature parameter controlling deviation from reference policy.
 
    Returns:
        Scalar DPO loss value.
    """
    chosen_rewards = beta * (policy_logprobs_chosen - ref_logprobs_chosen)
    rejected_rewards = beta * (policy_logprobs_rejected - ref_logprobs_rejected)
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    return loss
 
 
Simulated log-probabilities for a batch of 4 preference pairs
batch_size = 4
policy_chosen = torch.randn(batch_size, requires_grad=True)
policy_rejected = torch.randn(batch_size, requires_grad=True)
ref_chosen = torch.randn(batch_size)
ref_rejected = torch.randn(batch_size)
 
loss = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
loss.backward()
print(f"DPO Loss: {loss.item():.4f}")