Policy Gradient Theorem

Definition

Policy Gradient Theorem

The Policy Gradient Theorem provides an analytic expression for the gradient of the performance objective with respect to the policy parameters . It allows the agent to update its policy directly without necessarily needing a value function.

Mathematical Formulation

The theorem states that for any differentiable policy , the gradient of the performance (e.g., the average reward per step or the value of the start state) is:

Policy Gradient Theorem

In expected value form (commonly used for stochastic gradient ascent):

where:

  • — Performance measure (e.g., )
  • — On-policy state distribution under
  • — True action-value function under policy
  • Score function

Intuition

The Likelihood Ratio Trick

The theorem is powerful because the gradient does not depend on the gradient of the state distribution , which is typically unknown and depends on the environment’s dynamics. Instead, it only depends on the gradient of the policy and the value of action in state . We increase the probability of actions that lead to high reward ( is positive/large) and decrease it otherwise.

Key Properties

  • Foundation: It is the theoretical basis for all policy gradient algorithms.
  • Objective: Direct optimization of the policy .
  • Action Selection: Naturally handles continuous action spaces (unlike max-based methods like Q-Learning).

Connections

Appears In

  • future Week 5 lecture