Group Relative Policy Optimization (GRPO)

GRPO

A policy gradient algorithm that estimates advantages through group-relative comparisons rather than a learned value function. For each prompt, multiple responses are sampled and their rewards are normalized within the group to compute advantages.

Motivation

Standard PPO requires a critic network to estimate the value function $V (s)$ for advantage computation. For large language models:

Adding a critic roughly doubles parameters
Training the critic is challenging (moving target)
Critic quality directly impacts gradient quality

GRPO eliminates the critic by using within-group statistics.

Algorithm

For each prompt $x$ , sample $G$ responses ${y_{1}, \dots, y_{G}}$ from the current policy $π_{θ}$ .

GRPO Advantage

$\hat{A}_{i} = \frac{R ( x , y _{i} ) - μ _{G}}{σ _{G}}$

where:

$μ_{G} = \frac{1}{G} \sum_{j = 1}^{G} R (x, y_{j})$ — group mean reward

$σ_{G} = std ({R (x, y_{j})}_{j = 1}^{G})$ — group standard deviation

The advantage is the z-score of the response within its group

Gradient update: $\nabla_{θ} J \approx \frac{1}{∣ B ∣} \sum_{x \in B} \frac{1}{G} \sum_{i = 1}^{G} \hat{A}_{i} \nabla_{θ} lo g π_{θ} (y_{i} ∣ x)$

Comparison with PPO

Aspect	PPO	GRPO
Advantage source	Learned $V_{ϕ} (s)$	Group statistics
Additional networks	Critic	None
Samples per prompt	1	$G$ (typically 4-16)
Memory	High	Low
Implementation	Complex	Simple

Intuition

Why Group Normalization Works

Responses are compared relative to each other, not absolute value

Good responses get positive advantage, bad ones get negative

Automatically adapts to reward scale

Handles sparse rewards (0/1) naturally

Properties

No critic training: Simpler optimization landscape
Automatic baseline: Group mean serves as baseline
Variance normalization: Group std normalizes gradient scale
Sample efficient: Reuses multiple samples per prompt

Connections

Simplifies PPO by removing critic
Used in DeepSeek-R1 and SEARCH-R1
Related to REINFORCE with baseline
Alternative to Actor-Critic methods

Appears In

IR-L13 - RL for Reasoning and Search
DeepSeek-R1 paper (2025)
SEARCH-R1 paper (2025)
RS-L04 - Generative Recommendation

Study Notes

Explorer

GRPO

Group Relative Policy Optimization (GRPO)

Motivation

Algorithm

Comparison with PPO

Intuition

Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks