Group Relative Policy Optimization (GRPO)
GRPO
A policy gradient algorithm that estimates advantages through group-relative comparisons rather than a learned value function. For each prompt, multiple responses are sampled and their rewards are normalized within the group to compute advantages.
Motivation
Standard PPO requires a critic network to estimate the value function for advantage computation. For large language models:
- Adding a critic roughly doubles parameters
- Training the critic is challenging (moving target)
- Critic quality directly impacts gradient quality
GRPO eliminates the critic by using within-group statistics.
Algorithm
For each prompt , sample responses from the current policy .
GRPO Advantage
where:
- — group mean reward
- — group standard deviation
- The advantage is the z-score of the response within its group
Gradient update:
Comparison with PPO
| Aspect | PPO | GRPO |
|---|---|---|
| Advantage source | Learned | Group statistics |
| Additional networks | Critic | None |
| Samples per prompt | 1 | (typically 4-16) |
| Memory | High | Low |
| Implementation | Complex | Simple |
Intuition
Why Group Normalization Works
- Responses are compared relative to each other, not absolute value
- Good responses get positive advantage, bad ones get negative
- Automatically adapts to reward scale
- Handles sparse rewards (0/1) naturally
Properties
- No critic training: Simpler optimization landscape
- Automatic baseline: Group mean serves as baseline
- Variance normalization: Group std normalizes gradient scale
- Sample efficient: Reuses multiple samples per prompt
Connections
- Simplifies PPO by removing critic
- Used in DeepSeek-R1 and SEARCH-R1
- Related to REINFORCE with baseline
- Alternative to Actor-Critic methods
Appears In
- IR-L13 - RL for Reasoning and Search
- DeepSeek-R1 paper (2025)
- SEARCH-R1 paper (2025)