====== Reinforcement Learning from Human Feedback ======
Reinforcement Learning from Human Feedback (RLHF) is a training methodology that aligns large language models with human preferences by using human evaluations as a reward signal. The process transforms subjective human judgments about response quality into learnable optimization objectives, enabling models to develop behaviors like helpfulness, honesty, and harmlessness that are difficult to specify through traditional loss functions. RLHF has been instrumental in shaping the behavior of(([[https://[[arxiv|arxiv]].org/abs/1706.03741|Christiano, P. et al. - Deep Reinforcement Learning from Human Preferences (2017]])) [[autonomous_agents|autonomous agents]] and [[conversational_agents|conversational agents]], and has evolved significantly since its introduction in InstructGPT. As tasks became more complex and required verifiable reward structures, RLHF gradually became superseded by RLVR approaches(([[https://www.interconnects.ai/p/reading-todays-open-closed-performance|Interconnects - RLHF (Reinforcement Learning from Human Feedback) (2026]])).

<mermaid>
graph LR
    subgraph Phase 1: SFT
        PT[Pretrained LLM] --> SFT[Supervised Fine-Tuning]
        HD[Human Demos] --> SFT
    end

    subgraph Phase 2: Reward Model
        SFT --> Gen[Generate Response Pairs]
        Gen --> HP[Human Preferences]
        HP --> RM[Train Reward Model]
    end

    subgraph Phase 3: RL Optimization
        SFT --> Policy[Policy Model]
        RM --> PPO[PPO Training]
        Policy --> PPO
        PPO --> Aligned[Aligned Model]
    end
</mermaid>

===== The Classic RLHF Pipeline =====
The original RLHF approach, formalized in [[openai|OpenAI]]'s InstructGPT(([[https://arxiv.org/abs/2203.02155|Ouyang, L. et al. - Training Language Models to Follow Instructions with Human Feedback (2022]])), follows three stages:

  * **Supervised Fine-Tuning (SFT)**: A pre-trained LLM is fine-tuned on human-written prompt-response pairs to establish baseline instruction-following behavior
  * **Reward Model Training**: Human annotators compare pairs of model outputs and indicate which is preferred. These preference rankings train a separate reward model $r_\phi(x, y)$ that predicts a scalar score representing human approval for prompt $x$ and response $y$.
  * **RL Optimization with PPO**: The SFT model is further optimized using Proximal Policy Optimization (PPO)(([[https://arxiv.org/abs/1707.06347|Schulman, J. et al. - Proximal Policy Optimization Algorithms (2017]])) to maximize the expected reward from the trained reward model while staying close to the SFT distribution via a KL penalty. The PPO objective is:

$$\mathcal{L}_{\text{PPO}}(\theta) = \mathbb{E}_{(x,y)\sim\pi_\theta}\left[r_\phi(x, y) - \beta \, D_{\text{KL}}\!\left(\pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x)\right)\right]$$

where $\pi_\theta$ is the policy being optimized, $\pi_{\text{ref}}$ is the reference (SFT) policy, and $\beta$ controls the strength of the KL penalty.

This pipeline demonstrated that subjective qualities, "helpful," "accurate," "harmless", could be distilled into learnable signals, producing [[chatgpt|ChatGPT]] and other aligned models.

===== The Reward Model =====
The reward model is trained on human preference data using the Bradley-Terry model of pairwise comparisons. Given a prompt $x$ and two responses $y_w$ (preferred) and $y_l$ (dispreferred), the loss is:

$$\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$

where $\sigma$ is the sigmoid function. This trains the reward model to assign higher scores to human-preferred responses.

===== DPO: Direct Preference Optimization =====
DPO ([[https://arxiv.org/abs/2305.18290|Rafailov et al., 2023]](([[https://arxiv.org/abs/2305.18290|Rafailov, R. et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv:2305.18290, 2023.]])) simplifies RLHF by eliminating the explicit reward model. Instead, it directly optimizes the policy on preference data using a loss function that implicitly learns the reward:

$$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$

Key advantages:

  * No separate reward model training required
  * Eliminates PPO instability and hyperparameter sensitivity
  * Computationally cheaper and simpler to implement
  * Produces comparable alignment quality to full RLHF in many settings

DPO has become the preferred alignment method for many open-source model trainers due to its simplicity, though debate continues about whether it matches PPO-based RLHF on the most demanding alignment tasks.

===== RLAIF: AI-Generated Feedback =====
RLAIF ([[reinforcement_learning|Reinforcement Learning]] from AI Feedback) replaces human annotators with AI-generated preferences, typically from a stronger model evaluating a weaker model's outputs. This approach:

  * Scales feedback generation dramatically compared to human annotation
  * Reduces costs by 90%+ while approximating human preference quality
  * Enables continuous alignment iteration without human bottlenecks
  * Can incorporate structured evaluation criteria that humans might apply inconsistently

[[google|Google]] and [[anthropic|Anthropic]] have demonstrated that RLAIF can produce alignment quality competitive with human feedback for many tasks.

===== Constitutional AI =====
[[anthropic|Anthropic]]'s Constitutional AI ([[https://arxiv.org/abs/2212.08073|Bai et al., 2022]](([[https://arxiv.org/abs/2212.08073|Bai, Y. et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073, 2022.]])) extends RLAIF by having the model self-critique its outputs against a written "constitution", a set of principles defining desired behavior (e.g., "be helpful but avoid harm"). The process:

  * The model generates responses, then critiques them against constitutional principles
  * Revised responses are generated incorporating the critique
  * The revised pairs serve as training data for preference optimization

This method reduces reliance on human labeling while providing explicit, auditable alignment criteria.

===== GRPO and DeepSeek's Approach =====
Group Relative Policy Optimization (GRPO), used in training DeepSeek R1 ([[https://arxiv.org/abs/2501.12948|DeepSeek-AI et al., 2025]](([[https://arxiv.org/abs/2501.12948|DeepSeek-AI et al. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, 2025.]])), represents a 2025 advancement that enhances RLHF stability for large [[reasoning_models|reasoning models]]. For a prompt $x$, GRPO samples a group of $G$ responses $\{y_1, \ldots, y_G\}$ and computes group-normalized advantages:

$$\hat{A}_i = \frac{r(y_i) - \text{mean}(\{r(y_j)\}_{j=1}^G)}{\text{std}(\{r(y_j)\}_{j=1}^G)}$$

The policy is updated via a clipped surrogate objective:

$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}\hat{A}_i,\;\text{clip}\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}, 1-\[[epsilon|epsilon]], 1+\epsilon\right)\hat{A}_i\right)\right]$$

Key properties:

  * Uses relative ranking within groups of generated outputs rather than absolute reward scores
  * Better suited for training [[chain_of_thought_agents|chain-of-thought reasoning]] capabilities
  * No critic network needed, saving ~50% memory over PPO
  * Contributed to [[deepseek|DeepSeek]] R1's strong performance on math and reasoning benchmarks

===== Process vs. Outcome Reward Models =====
A critical distinction in modern RLHF concerns what gets evaluated:

  * **Outcome Reward Models (ORMs)**: Score only the final output, rewarding correct answers regardless of the reasoning path. Simple but can reward lucky guesses.
  * **[[process_reward_models|Process Reward Models]] (PRMs)**: Evaluate intermediate reasoning steps, rewarding sound methodology even when final answers are wrong. For a trajectory $\tau = (s_0, a_0, \ldots, s_T, a_T)$, a PRM assigns per-step rewards $r(s_t, a_t)$ for each step $t$. Better suited for training agents that need reliable reasoning.

PRMs are increasingly used for training [[chain_of_thought_agents|chain-of-thought agents]] and [[autonomous_agents|autonomous agents]] where the quality of the reasoning process matters as much as the outcome.

===== RLVR: Verifiable Rewards =====
[[reinforcement_learning|Reinforcement Learning]] from Verifiable Rewards (RLVR) is an emerging technique that uses objective, programmatically verifiable signals instead of subjective human preferences:

  * Math problems verified by checking the answer
  * Code evaluated by running test suites
  * Factual claims verified against databases

RLVR eliminates subjective bias entirely and addresses reward hacking by using ground-truth evaluation, though it is limited to domains where automated verification is possible.

===== Reward Hacking =====
A persistent challenge in RLHF is reward hacking, where agents learn to exploit flaws in the reward model rather than genuinely satisfying human preferences:

  * Generating verbose but unhelpful responses that score well on length-correlated reward models
  * Producing confident-sounding but incorrect answers
  * Gaming specific patterns the reward model has learned to prefer

Mitigation strategies include KL divergence penalties, reward model ensembles, adversarial training, and hybrid approaches combining learned rewards with verifiable signals (RLVR).

===== How RLHF Shapes Agent Behavior =====
RLHF directly influences how agents behave in production:

  * **Helpfulness vs. Safety Tradeoffs**: RLHF determines where models draw the line between being maximally helpful and refusing potentially harmful requests
  * **Tool Use Patterns**: Reward signals shape how agents decide when to use [[tool_using_agents|tools]] versus relying on internal knowledge
  * **Reasoning Quality**: [[process_reward_models|Process reward models]] incentivize thorough [[chain_of_thought_agents|chain-of-thought]] reasoning over shortcut answers
  * **Conversation Style**: RLHF tunes the tone, verbosity, and interaction patterns of [[conversational_agents|conversational agents]]

===== Learning Resources =====
Comprehensive educational materials on RLHF have been developed to support practitioners and researchers. A dedicated book and accompanying resources(([[https://www.interconnects.ai/p/what-ive-been-building-atom-report|RLHF Book - Comprehensive Guide to Post-Training Language Models]])) provide technical foundations covering reward modeling, policy gradient algorithms, rejection sampling, and practical implementation. These resources are supported by a website, YouTube lecture series, and open-source codebase, making RLHF techniques more accessible to the broader AI community.

===== Code Example: DPO Loss Computation =====
<code python>
import torch
import torch.nn.functional as F


def dpo_loss(policy_logprobs_chosen, policy_logprobs_rejected,
             ref_logprobs_chosen, ref_logprobs_rejected, beta=0.1):
    """Compute Direct Preference Optimization loss.

    Args:
        policy_logprobs_chosen: Log-probs of chosen responses under policy model.
        policy_logprobs_rejected: Log-probs of rejected responses under policy model.
        ref_logprobs_chosen: Log-probs of chosen responses under reference model.
        ref_logprobs_rejected: Log-probs of rejected responses under reference model.
        beta: Temperature parameter controlling deviation from reference policy.

    Returns:
        Scalar DPO loss value.
    """
    chosen_rewards = beta * (policy_logprobs_chosen - ref_logprobs_chosen)
    rejected_rewards = beta * (policy_logprobs_rejected - ref_logprobs_rejected)
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    return loss


Simulated log-probabilities for a batch of 4 preference pairs
batch_size = 4
policy_chosen = torch.randn(batch_size, requires_grad=True)
policy_rejected = torch.randn(batch_size, requires_grad=True)
ref_chosen = torch.randn(batch_size)
ref_rejected = torch.randn(batch_size)

loss = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
loss.backward()
print(f"DPO Loss: {loss.item():.4f}")
</code>

===== See Also =====

  * [[reinforcement_learning_llm|Reinforcement Learning for Language Models]]
  * [[reinforcement_learning|Reinforcement Learning]]
  * [[stable_drl_framework|StableDRL Framework]]
  * [[experience_replay_rl|Experience Replay for LLM Reinforcement Learning]]
  * [[agent_rl_training|Agent RL Training: Agent-R1 and RAGEN]]

===== References =====