Reinforcement Learning from Human Feedback

Definition

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a three-stage pipeline for aligning a pretrained generative model (typically an LLM) with human preferences. Instead of optimizing a hand-specified reward, RLHF (1) collects pairwise human preference comparisons over model outputs, (2) trains a reward model to predict those preferences, and (3) uses Reinforcement Learning (usually Proximal Policy Optimization) to fine-tune the generative policy to maximize the learned reward while staying close to the original supervised model via a KL penalty.

Intuition

For tasks like helpful dialogue or summarization there is no programmable reward — “quality” lives in human judgement, and absolute scalar ratings are noisy and uncalibrated. RLHF exploits the fact that humans are far more reliable at relative judgements: “response A is better than response B.” This is exactly the pairwise setting from Learning to Rank — the reward model is trained with a RankNet-style logistic loss on score differences. Once a differentiable reward model captures the preference signal, we can score the model’s own generations and push the policy toward outputs the reward model prefers.

The KL penalty is the crucial safety valve: maximizing a learned reward unconstrained leads to reward hacking (the policy finds adversarial outputs that fool but are gibberish to humans). Anchoring the policy to the supervised reference keeps generations fluent and on-distribution.

Mathematical Formulation

Stage 1 — Supervised fine-tuning (SFT). Start from a pretrained model and fine-tune on demonstration data to get the reference policy .

Stage 2 — Reward model. Given a prompt and two completions where humans labelled (winner) preferred over (loser), fit a scalar reward model under the Bradley–Terry preference model:

where:

  • — scalar reward (a regression head on top of the LM), read off the final token
  • — logistic sigmoid mapping a score difference to a preference probability
  • — human judged better than for prompt
  • — dataset of human pairwise comparisons

Note this is identical in form to the RankNet logistic loss on the score difference .

Stage 3 — RL fine-tuning. Optimize the policy to maximize the learned reward minus a per-token KL penalty to the reference:

equivalently optimized with PPO using a per-token shaped reward:

where:

  • — the trainable policy (LLM), initialized from
  • — frozen reward model from Stage 2, giving a sparse terminal reward
  • — KL coefficient controlling how far the policy may drift from
  • — penalizes the policy for moving away from the SFT model (prevents reward hacking / mode collapse)

The MDP framing matches IR-L13 - RL for Reasoning and Search: state = prompt + tokens so far, action = next token, = the LM’s token distribution; the reward is sparse (delivered at the end of the completion).

Key Properties / Variants

  • Why pairwise, not absolute: relative preferences are cheaper and more consistent to elicit from annotators than calibrated scalar scores — directly the Pairwise Learning to Rank argument.
  • PPO is the standard RL optimizer (InstructGPT, ChatGPT, Claude). Requires four models in memory: policy, reference, reward model, and critic (value network).
  • KL penalty is load-bearing: without it the policy reward-hacks ; with too-large it never moves from the SFT model.
  • GRPO variant: Group Relative Policy Optimization drops the critic and replaces the advantage with a group-normalized z-score over sampled completions, used in DeepSeek-R1 and SEARCH-R1. Cheaper but with verifiable rewards it can skip the learned reward model entirely (RL from verifiable reward).
  • DPO (Direct Preference Optimization): reparameterizes the Stage-2/Stage-3 objective into a single supervised loss, optimizing the preference objective directly on with no explicit reward model or RL rollouts.
  • Failure modes: reward over-optimization (Goodhart), sycophancy, distribution shift between RM training data and policy generations, annotator disagreement.
Algorithm: RLHF (PPO variant)
──────────────────────────────────────────────
Stage 1 — SFT:
  π_ref ← fine-tune pretrained LM on demonstration data
  π_θ   ← copy of π_ref   (the trainable policy)
 
Stage 2 — Reward Model:
  Collect comparisons: for prompt x, humans label y_w ≻ y_l
  Fit r_φ by minimizing:
    L_RM = - E[ log σ( r_φ(x, y_w) - r_φ(x, y_l) ) ]
 
Stage 3 — RL fine-tuning (PPO):
  Loop:
    Sample prompts x ~ D
    Generate completions y ~ π_θ(·|x)             # rollouts
    Compute reward:
      R(x,y) = r_φ(x,y) - β · Σ_t log[ π_θ(y_t|·) / π_ref(y_t|·) ]
    Estimate advantages Â_t (GAE, via critic V_ψ)
    For K epochs:
      r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)
      maximize L^CLIP = E[ min( r_t Â_t,
                                clip(r_t, 1-ε, 1+ε) Â_t ) ]
    θ_old ← θ

Connections

Appears In