Policy Gradient Theorem
Definition
Policy Gradient Theorem
The Policy Gradient Theorem provides an analytic expression for the gradient of the performance objective with respect to the policy parameters . It allows the agent to update its policy directly without necessarily needing a value function.
Mathematical Formulation
The theorem states that for any differentiable policy , the gradient of the performance (e.g., the average reward per step or the value of the start state) is:
Policy Gradient Theorem
In expected value form (commonly used for stochastic gradient ascent):
where:
- — Performance measure (e.g., )
- — On-policy state distribution under
- — True action-value function under policy
- — Score function
Intuition
The Likelihood Ratio Trick
The theorem is powerful because the gradient does not depend on the gradient of the state distribution , which is typically unknown and depends on the environment’s dynamics. Instead, it only depends on the gradient of the policy and the value of action in state . We increase the probability of actions that lead to high reward ( is positive/large) and decrease it otherwise.
Key Properties
- Foundation: It is the theoretical basis for all policy gradient algorithms.
- Objective: Direct optimization of the policy .
- Action Selection: Naturally handles continuous action spaces (unlike max-based methods like Q-Learning).
Connections
- Derived for: REINFORCE (uses sample returns to estimate )
- Extended in: Actor-Critic (uses a learned Critic to estimate )
- Relies on: State Space and Reward Signal definitions
Appears In
- future Week 5 lecture