Actor-Critic
Definition
Actor-Critic Methods
Actor-Critic methods are a class of Reinforcement Learning algorithms that combine both policy-based and value-based approaches. They consist of two components:
- Actor: Responsible for selecting actions by maintaining a parameterized policy .
- Critic: Responsible for evaluating the actions taken by the actor by estimating a value function (usually the state-value function or action-value function ).
Intuition
Evaluation Reduces Variance
In pure policy gradient methods like REINFORCE, the update uses the full episode return , which has high variance because it depends on many random actions and transitions. Actor-Critic methods replace the return with a value estimate from the Critic. This “bootstrapping” significantly reduces variance, leading to faster and more stable learning, though at the cost of introducing some bias from the value estimate.
Mathematical Formulation
The Actor update typically follows the gradient of the performance objective, often using the Advantage function to indicate how much better an action was than average:
Actor-Critic Update (Advantage Actor-Critic)
Actor Update:
Critic Update (TD Error):
where:
- is the Advantage (estimated by the TD error)
- — Actor parameters
- — Critic parameters
Variants and Key Properties
- A2C (Advantage Actor-Critic): Synchronous version where multiple workers update a global model.
- A3C (Asynchronous Advantage Actor-Critic): Asynchronous version where workers update global parameters independently.
- Relationship to Policy Gradient: It is a specialized form of the Policy Gradient Theorem where the Critic provides the -value or Advantage estimate.
Connections
- Combines: Policy Gradient Theorem and Temporal Difference Learning
- Improves upon: REINFORCE (by reduces variance through bootstrapping)
- Foundation for: Soft Actor-Critic (SAC), PPO, DDPG
Appears In
- RL-L08 - Policy Gradient and Actor-Critic (mentioned)
- future weeks