Policy Gradient Theorem

Definition

The Policy Gradient Theorem is a fundamental result in reinforcement learning that expresses the gradient of expected return with respect to policy parameters:

$\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [G (τ) \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})]$

This equation is the foundation for all policy gradient methods. It says: to increase expected return, increase the log-probability of actions in high-return trajectories.

Intuition

Why This Makes Sense

Imagine sampling episodes (trajectories) from your current policy:

Episodes with high total reward should be reinforced
Episodes with low total reward should be deprioritized
The log-probability gradient acts as a “handle” to adjust the policy

The algorithm works by:

Sample an episode and measure its return $G$
Compute $\nabla_{θ} lo g π_{θ} (a ∣ s)$ for each action
Update: $θ \leftarrow θ + α \cdot G \cdot \nabla_{θ} lo g π_{θ} (a ∣ s)$

Result: Good actions become more likely, bad actions less likely.

The Log-Derivative Trick

The key technical insight is the log-derivative trick:

$\nabla_{x} f (x) = f (x) \nabla_{x} lo g f (x)$

This allows us to move the gradient inside an expectation with respect to a distribution that depends on the parameters:

$\nabla_{θ} E_{x \sim p_{θ} (x)} [f (x)] = E_{x \sim p_{θ} (x)} [f (x) \nabla_{θ} lo g p_{θ} (x)]$

This is why we work with log-probabilities: they make the gradient tractable.

Mathematical Derivation

Starting Point

For an episodic task with trajectories $τ = (s_{0}, a_{0}, r_{0}, \dots, s_{T})$ :

$J (θ) = E_{τ} [G (τ)] = \int p_{θ} (τ) G (τ) d τ$

Applying the Log-Derivative Trick

$\nabla_{θ} J = \nabla_{θ} \int p_{θ} (τ) G (τ) d τ = \int \nabla_{θ} p_{θ} (τ) G (τ) d τ$

Using the log-derivative trick:

$\nabla_{θ} p_{θ} (τ) = p_{θ} (τ) \nabla_{θ} lo g p_{θ} (τ)$

Therefore:

$\nabla_{θ} J = \int p_{θ} (τ) \nabla_{θ} lo g p_{θ} (τ) G (τ) d τ = E_{τ} [\nabla_{θ} lo g p_{θ} (τ) \cdot G (τ)]$

Factoring the Trajectory Probability

The trajectory probability factors as:

$p_{θ} (τ) = p (s_{0}) \prod_{t = 0}^{T - 1} π_{θ} (a_{t} ∣ s_{t}) \cdot p (s_{t + 1} ∣ a_{t}, s_{t})$

The log:

$lo g p_{θ} (τ) = lo g p (s_{0}) + \sum_{t = 0}^{T - 1} [lo g π_{θ} (a_{t} ∣ s_{t}) + lo g p (s_{t + 1} ∣ a_{t}, s_{t})]$

Gradient w.r.t. $θ$ (only the policy term depends on $θ$ ):

$\nabla_{θ} lo g p_{θ} (τ) = \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$

Final Result

Substituting back:

$\nabla_{θ} J = E_{τ} [G (τ) \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})]$

Key Properties

Unbiasedness

The estimator is unbiased: sampling one trajectory gives an unbiased estimate of the true gradient.

$E [\hat{\nabla}_{θ} J] = \nabla_{θ} J$

Consistency

With enough samples, the empirical average converges to the true gradient (by law of large numbers).

On-Policy

The theorem requires samples from the current policy $π_{θ}$ . Using samples from a different policy (off-policy) requires importance sampling correction.

Direct Dependency on Dynamics is Not Needed

Crucially, the dynamics $p (s_{t + 1} ∣ a_{t}, s_{t})$ cancel out in the gradient. We don’t need to know or learn the environment dynamics!

Variations and Extensions

Causality-Aware Version

We can improve variance by noting that action $a_{t}$ only affects rewards at time $t$ onward:

$\nabla_{θ} J = E [\sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot G_{t}]$

where $G_{t} = \sum_{t^{'} = t}^{T - 1} γ^{t^{'} - t} r_{t^{'}}$ .

With Baseline

Subtracting any baseline $b (s)$ preserves unbiasedness:

$\nabla_{θ} J = E [\sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot (G_{t} - b (s_{t}))]$

Continuous Time / Continuing Tasks

The theorem extends to continuing MDPs with discounted returns:

$J (θ) = E [\sum_{t = 0}^{\infty} γ^{t} r_{t}]$

State Value Version

Can also express as:

$\nabla_{θ} J (θ) \propto E_{s \sim ρ (s)} [E_{a \sim π_{θ} (a ∣ s)} [\nabla_{θ} lo g π_{θ} (a ∣ s) Q^{π} (s, a)]]$

where $ρ (s)$ is state visitation distribution and $Q^{π}$ is the action-value function.

Practical Implications

Algorithm Design

The theorem motivates:

REINFORCE: Use Monte Carlo sample of return $G$ directly
Actor-Critic: Use learned $Q (s, a)$ or $V (s)$ estimate instead of full $G$
PPO: Efficient trust-region variant of policy gradients
A2C: Parallel actor-critic with baseline

Gradient Variance

The theorem shows variance comes from:

Return sampling: Monte Carlo returns have high variance
Policy stochasticity: Exploration adds noise

Variance reduction techniques:

Baselines: Subtract expected return $V (s)$
Advantage estimates: Use $Q (s, a) - V (s)$ instead of raw returns
Function approximation: Smooth out noisy returns

Connections

Foundation of: Policy Gradient Methods, Actor-Critic, PPO
Related to: Log derivative trick, Gradient ascent
Assumes: Policy is differentiable w.r.t. parameters
Versus: Bellman equation (basis of value-based methods)
Enables: Model-free learning (no dynamics needed)

Appears In

Policy Gradient Methods — Core theoretical foundation
REINFORCE — Direct application
Actor-Critic — Extends with learned value
Advantage Actor-Critic (A2C) — With baselines
PPO — Trust-region variant
Deep Reinforcement Learning — When using neural network policies

Study Notes

Explorer

Policy Gradient Theorem

Policy Gradient Theorem

Definition

Intuition

Why This Makes Sense

The Log-Derivative Trick

Mathematical Derivation

Starting Point

Applying the Log-Derivative Trick

Factoring the Trajectory Probability

Final Result

Key Properties

Unbiasedness

Consistency

On-Policy

Direct Dependency on Dynamics is Not Needed

Variations and Extensions

Causality-Aware Version

With Baseline

Continuous Time / Continuing Tasks

State Value Version

Practical Implications

Algorithm Design

Gradient Variance

Connections

Appears In

Graph View

Table of Contents

Backlinks