Reward Overoptimization

Reward overoptimization occurs when a language model optimized against a learned reward model achieves high proxy reward scores while actual (ground-truth) quality plateaus or declines. Gao et al. (2022) provided the first systematic study of this phenomenon, discovering predictable scaling laws that govern when and how over-optimization manifests — a concrete instantiation of Goodhart's Law in RLHF.

The Core Problem

In RLHF, a reward model $\hat{r}$ trained on human preference data serves as a proxy for true human preferences $r^*$. As optimization pressure against $\hat{r}$ increases, the policy learns to exploit imperfections in the proxy rather than genuinely improving output quality. The proxy reward $\hat{r}$ increases monotonically, but the true reward $r^*$ follows an inverted-U shape — initially rising, then declining.

This is Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.

Experimental Setup

Gao et al. use a synthetic framework: a large gold reward model (trained on more data) serves as ground-truth, while smaller proxy reward models are optimized against. This enables precise measurement of the proxy-gold divergence across thousands of experiments without expensive human evaluation.

Two optimization methods are studied:

Best-of-n (BoN) sampling: Generate $n$ responses, select the one with highest proxy reward
Reinforcement learning (RL): Optimize the policy via PPO against the proxy reward model

Scaling Laws

The relationship between gold (true) reward $R$ and KL divergence $D_{KL}$ from the initial policy follows predictable functional forms:

For RL optimization

$$R_{RL} = \alpha_{RL} \sqrt{D_{KL}} - \beta_{RL} \, D_{KL}$$

For Best-of-n sampling

$$R_{BoN} = \alpha_{BoN} \sqrt{D_{KL}} - \beta_{BoN} \, D_{KL}$$

The $\alpha$ term captures beneficial optimization (square-root gains), while $\beta$ captures overoptimization (linear penalty). The peak true reward occurs at:

$$D_{KL}^* = \left(\frac{\alpha}{2\beta}\right)^2$$

Beyond this point, further optimization decreases true quality despite increasing proxy scores.

Key Findings

Square-root relationship: At low KL, gold reward scales as $\sqrt{D_{KL}}$, showing diminishing returns even before overoptimization
RL vs BoN efficiency: RL consumes more KL budget than BoN for comparable gold reward, making it more prone to overoptimization
Reward model size: Larger RMs yield higher $\alpha$ (more benefit) and lower $\beta$ (less overoptimization)
KL penalties insufficient: Adding KL penalties to RL does not meaningfully shift the gold reward frontier
Predictable scaling: Coefficients $\alpha$ and $\beta$ scale smoothly with RM parameter count

import numpy as np
 
def gold_reward_curve(d_kl, alpha, beta):
    """Predicted gold reward as function of KL divergence."""
    return alpha * np.sqrt(d_kl) - beta * d_kl
 
def optimal_kl(alpha, beta):
    """KL divergence at peak gold reward."""
    return (alpha / (2 * beta)) ** 2
 
def peak_reward(alpha, beta):
    """Maximum achievable gold reward."""
    d_star = optimal_kl(alpha, beta)
    return gold_reward_curve(d_star, alpha, beta)
 
# Larger RM has better alpha/beta ratio
alpha_sm, beta_sm = 0.5, 0.02
alpha_lg, beta_lg = 0.8, 0.015
print(f"Small RM peak KL: {optimal_kl(alpha_sm, beta_sm):.1f}")
print(f"Large RM peak KL: {optimal_kl(alpha_lg, beta_lg):.1f}")

Goodhart's Taxonomy

The paper maps findings to four types of Goodhart effects:

Regressional: Reward model noise is exploited by optimization. Corresponds to the $\beta$ term.
Extremal: RM becomes inaccurate in out-of-distribution regions reached by heavy optimization.
Causal: Optimization disrupts the causal relationship between proxy and true reward.
Adversarial: The policy actively attacks RM weaknesses (most relevant for RL).

Mitigation Strategies

Reward model ensembles: Multiple RMs reduce exploitability through agreement
KL budgeting: Use scaling laws to set KL limits that stay before the overoptimization peak
Larger reward models: Scale RM capacity to improve the $\alpha/\beta$ ratio
Iterative RLHF: Periodically retrain the RM on current policy outputs
Prefer BoN over RL: BoN is more KL-efficient when budgets are limited

Table of Contents