Reward overoptimization occurs when a language model optimized against a learned reward model achieves high proxy reward scores while actual (ground-truth) quality plateaus or declines. Gao et al. (2022) provided the first systematic study of this phenomenon, discovering predictable scaling laws that govern when and how over-optimization manifests — a concrete instantiation of Goodhart's Law in RLHF.
In RLHF, a reward model $\hat{r}$ trained on human preference data serves as a proxy for true human preferences $r^*$. As optimization pressure against $\hat{r}$ increases, the policy learns to exploit imperfections in the proxy rather than genuinely improving output quality. The proxy reward $\hat{r}$ increases monotonically, but the true reward $r^*$ follows an inverted-U shape — initially rising, then declining.
This is Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.
Gao et al. use a synthetic framework: a large gold reward model (trained on more data) serves as ground-truth, while smaller proxy reward models are optimized against. This enables precise measurement of the proxy-gold divergence across thousands of experiments without expensive human evaluation.
Two optimization methods are studied:
The relationship between gold (true) reward $R$ and KL divergence $D_{KL}$ from the initial policy follows predictable functional forms:
$$R_{RL} = \alpha_{RL} \sqrt{D_{KL}} - \beta_{RL} \, D_{KL}$$
$$R_{BoN} = \alpha_{BoN} \sqrt{D_{KL}} - \beta_{BoN} \, D_{KL}$$
The $\alpha$ term captures beneficial optimization (square-root gains), while $\beta$ captures overoptimization (linear penalty). The peak true reward occurs at:
$$D_{KL}^* = \left(\frac{\alpha}{2\beta}\right)^2$$
Beyond this point, further optimization decreases true quality despite increasing proxy scores.
import numpy as np def gold_reward_curve(d_kl, alpha, beta): """Predicted gold reward as function of KL divergence.""" return alpha * np.sqrt(d_kl) - beta * d_kl def optimal_kl(alpha, beta): """KL divergence at peak gold reward.""" return (alpha / (2 * beta)) ** 2 def peak_reward(alpha, beta): """Maximum achievable gold reward.""" d_star = optimal_kl(alpha, beta) return gold_reward_curve(d_star, alpha, beta) # Larger RM has better alpha/beta ratio alpha_sm, beta_sm = 0.5, 0.02 alpha_lg, beta_lg = 0.8, 0.015 print(f"Small RM peak KL: {optimal_kl(alpha_sm, beta_sm):.1f}") print(f"Large RM peak KL: {optimal_kl(alpha_lg, beta_lg):.1f}")
The paper maps findings to four types of Goodhart effects: