Actor-Critic

Definition

Actor-Critic Methods

Actor-Critic methods are a class of Reinforcement Learning algorithms that combine both policy-based and value-based approaches. They consist of two components:

Actor: Responsible for selecting actions by maintaining a parameterized policy $π_{θ} (a ∣ s)$ .

Critic: Responsible for evaluating the actions taken by the actor by estimating a value function (usually the state-value function $V_{w} (s)$ or action-value function $Q_{w} (s, a)$ ).

Intuition

Evaluation Reduces Variance

In pure policy gradient methods like REINFORCE, the update uses the full episode return $G_{t}$ , which has high variance because it depends on many random actions and transitions. Actor-Critic methods replace the return with a value estimate from the Critic. This “bootstrapping” significantly reduces variance, leading to faster and more stable learning, though at the cost of introducing some bias from the value estimate.

Mathematical Formulation

The Actor update typically follows the gradient of the performance objective, often using the Advantage function to indicate how much better an action was than average:

Actor-Critic Update (Advantage Actor-Critic)

Actor Update: $θ_{t + 1} \leftarrow θ_{t} + α_{θ} \nabla lo g π_{θ} (A_{t} ∣ S_{t}) \hat{A} (S_{t}, A_{t})$

Critic Update (TD Error): $δ_{t} = R_{t + 1} + γ V_{w} (S_{t + 1}) - V_{w} (S_{t})$ $w_{t + 1} \leftarrow w_{t} + α_{w} δ_{t} \nabla V_{w} (S_{t})$

where:

$\hat{A} (S_{t}, A_{t}) \approx δ_{t}$ is the Advantage (estimated by the TD error)

$θ$ — Actor parameters

$w$ — Critic parameters

Variants and Key Properties

A2C (Advantage Actor-Critic): Synchronous version where multiple workers update a global model.
A3C (Asynchronous Advantage Actor-Critic): Asynchronous version where workers update global parameters independently.
Relationship to Policy Gradient: It is a specialized form of the Policy Gradient Theorem where the Critic provides the $Q$ -value or Advantage estimate.

Connections

Combines: Policy Gradient Theorem and Temporal Difference Learning
Improves upon: REINFORCE (by reduces variance through bootstrapping)
Foundation for: Soft Actor-Critic (SAC), PPO, DDPG

Appears In

RL-L09 - Policy Gradient Methods (mentioned)
future weeks

Study Notes

Explorer

Actor-Critic

Actor-Critic

Definition

Intuition

Mathematical Formulation

Variants and Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks