Group Relative Policy Optimization (GRPO)

GRPO

A policy gradient algorithm that estimates advantages through group-relative comparisons rather than a learned value function. For each prompt, multiple responses are sampled and their rewards are normalized within the group to compute advantages.

Motivation

Standard PPO requires a critic network to estimate the value function for advantage computation. For large language models:

  • Adding a critic roughly doubles parameters
  • Training the critic is challenging (moving target)
  • Critic quality directly impacts gradient quality

GRPO eliminates the critic by using within-group statistics.

Algorithm

For each prompt , sample responses from the current policy .

GRPO Advantage

where:

  • — group mean reward
  • — group standard deviation
  • The advantage is the z-score of the response within its group

Gradient update:

Comparison with PPO

AspectPPOGRPO
Advantage sourceLearned Group statistics
Additional networksCriticNone
Samples per prompt1 (typically 4-16)
MemoryHighLow
ImplementationComplexSimple

Intuition

Why Group Normalization Works

  • Responses are compared relative to each other, not absolute value
  • Good responses get positive advantage, bad ones get negative
  • Automatically adapts to reward scale
  • Handles sparse rewards (0/1) naturally

Properties

  • No critic training: Simpler optimization landscape
  • Automatic baseline: Group mean serves as baseline
  • Variance normalization: Group std normalizes gradient scale
  • Sample efficient: Reuses multiple samples per prompt

Connections

Appears In