TD3

Definition

TD3 (Twin Delayed DDPG)

TD3 is an off-policy, actor-critic algorithm for continuous control that fixes the systematic overestimation bias and brittle training of DDPG. It keeps DDPG’s deterministic actor and DPG update, but adds three tricks: (1) clipped double Q-learning — two critics, take the minimum as the target; (2) delayed policy updates — update the actor (and targets) less often than the critics; (3) target policy smoothing — add clipped noise to the target action so the critic cannot exploit sharp peaks.

Intuition

Why the min of two critics?

In value-based RL with Function Approximation, the max operator (and the implicit max in DDPG’s greedy actor) combined with noisy Q-estimates produces a positive bias: . The actor then chases these inflated values, and the error compounds through Bootstrapping. TD3’s answer is to train two independent critics and form the bootstrap target from . The min is a pessimistic estimate, so any upward error in one critic is suppressed — trading a small under-estimation (harmless) for removing the dangerous over-estimation.

Delayed updates: a moving actor chasing a still-noisy critic destabilises learning. Updating the actor only every critic steps lets the value estimate settle first (lower variance per actor update).

Target smoothing is a regulariser: similar actions should have similar values, so we fit the critic to a small neighbourhood around the target action rather than a single point, preventing overfitting to narrow Q-function spikes.

Mathematical Formulation

Target action with clipped smoothing noise:

Clipped double-Q target (shared by both critics):

Critic loss (each critic , regressed to the same ):

Delayed actor update (DPG, using only ), applied every steps:

Polyak (soft) target updates, also every steps:

where:

  • — deterministic actor; — its target network
  • — twin critics; — their target networks
  • — clipped double-Q: the pessimistic bootstrap that curbs overestimation
  • — target policy smoothing noise, clipped to
  • — smoothing noise std; — noise clip; — policy/target update delay (typically )
  • — discount factor; — Polyak averaging rate; — replay buffer
  • — TD target; note it is shared by both critics even though only drives the actor

Key Properties / Variants

  • Three improvements over DDPG: clipped double Q (bias), delayed actor/target updates (variance/stability), target policy smoothing (regularisation). DDPG = TD3 minus these three.
  • Off-policy: trains from a replay buffer ; behaviour = + exploration noise . Inherits DPG’s lack of Importance Sampling weights.
  • Two noises, different roles: exploration noise is added to actions taken in the environment; smoothing noise is added only to the target action when computing .
  • Both critics regress to the same target (the min), but only one critic () is used in the actor gradient — this avoids the actor exploiting the pessimistic min.
  • Deterministic, not stochastic: unlike Soft Actor-Critic (SAC), TD3 keeps a deterministic policy with no entropy term; exploration is purely injected noise.
Algorithm: TD3 (Twin Delayed DDPG)
──────────────────────────────────────────────────────────
Init critics Q_θ1, Q_θ2 and actor μ_φ with random params
Init targets θ'1 ← θ1, θ'2 ← θ2, φ' ← φ
Init replay buffer D
for t = 1 .. T:
  Select action with exploration noise:
    a = μ_φ(s) + ε,  ε ~ N(0, σ_explore)
  Execute a, observe r, s'; store (s,a,r,s') in D
  Sample mini-batch of N transitions (s,a,r,s') from D
 
  # ---- Target with smoothing + clipped double-Q ----
  ã ← μ_φ'(s') + clip(N(0,σ²), -c, +c)
  y ← r + γ · min( Q_θ'1(s',ã), Q_θ'2(s',ã) )
 
  # ---- Update BOTH critics toward the same y ----
  θ_i ← argmin_θ_i  (1/N) Σ (Q_θ_i(s,a) - y)²   for i = 1,2
 
  # ---- Delayed actor + target updates (every d steps) ----
  if t mod d == 0:
    update φ by deterministic policy gradient:
      ∇_φ J = (1/N) Σ ∇_a Q_θ1(s,a)|_{a=μ_φ(s)} ∇_φ μ_φ(s)
    Polyak update targets:
      θ'_i ← τ θ_i + (1-τ) θ'_i   for i = 1,2
      φ'   ← τ φ   + (1-τ) φ'

Connections

Appears In