Policy Gradient Theorem
Definition
The Policy Gradient Theorem is a fundamental result in reinforcement learning that expresses the gradient of expected return with respect to policy parameters:
This equation is the foundation for all policy gradient methods. It says: to increase expected return, increase the log-probability of actions in high-return trajectories.
Intuition
Why This Makes Sense
Imagine sampling episodes (trajectories) from your current policy:
- Episodes with high total reward should be reinforced
- Episodes with low total reward should be deprioritized
- The log-probability gradient acts as a “handle” to adjust the policy
The algorithm works by:
- Sample an episode and measure its return
- Compute for each action
- Update:
Result: Good actions become more likely, bad actions less likely.
The Log-Derivative Trick
The key technical insight is the log-derivative trick:
This allows us to move the gradient inside an expectation with respect to a distribution that depends on the parameters:
This is why we work with log-probabilities: they make the gradient tractable.
Mathematical Derivation
Starting Point
For an episodic task with trajectories :
Applying the Log-Derivative Trick
Using the log-derivative trick:
Therefore:
Factoring the Trajectory Probability
The trajectory probability factors as:
The log:
Gradient w.r.t. (only the policy term depends on ):
Final Result
Substituting back:
Key Properties
Unbiasedness
The estimator is unbiased: sampling one trajectory gives an unbiased estimate of the true gradient.
Consistency
With enough samples, the empirical average converges to the true gradient (by law of large numbers).
On-Policy
The theorem requires samples from the current policy . Using samples from a different policy (off-policy) requires importance sampling correction.
Direct Dependency on Dynamics is Not Needed
Crucially, the dynamics cancel out in the gradient. We don’t need to know or learn the environment dynamics!
Variations and Extensions
Causality-Aware Version
We can improve variance by noting that action only affects rewards at time onward:
where .
With Baseline
Subtracting any baseline preserves unbiasedness:
Continuous Time / Continuing Tasks
The theorem extends to continuing MDPs with discounted returns:
State Value Version
Can also express as:
where is state visitation distribution and is the action-value function.
Practical Implications
Algorithm Design
The theorem motivates:
- REINFORCE: Use Monte Carlo sample of return directly
- Actor-Critic: Use learned or estimate instead of full
- PPO: Efficient trust-region variant of policy gradients
- A2C: Parallel actor-critic with baseline
Gradient Variance
The theorem shows variance comes from:
- Return sampling: Monte Carlo returns have high variance
- Policy stochasticity: Exploration adds noise
Variance reduction techniques:
- Baselines: Subtract expected return
- Advantage estimates: Use instead of raw returns
- Function approximation: Smooth out noisy returns
Connections
- Foundation of: Policy Gradient Methods, Actor-Critic, PPO
- Related to: Log derivative trick, Gradient ascent
- Assumes: Policy is differentiable w.r.t. parameters
- Versus: Bellman equation (basis of value-based methods)
- Enables: Model-free learning (no dynamics needed)
Appears In
- Policy Gradient Methods — Core theoretical foundation
- REINFORCE — Direct application
- Actor-Critic — Extends with learned value
- Advantage Actor-Critic (A2C) — With baselines
- PPO — Trust-region variant
- Deep Reinforcement Learning — When using neural network policies