Temporal Difference Learning

Definition

Temporal Difference (TD) Learning

TD learning combines ideas from Monte Carlo Methods and Dynamic Programming. Like MC, it learns from experience without a model. Like DP, it updates estimates based on other estimates (bootstrapping) without waiting for the end of an episode.

TD(0) — The Core Update

TD(0) Update Rule

where:

  • — learning rate (step size)
  • TD target (one-step bootstrap estimate of )
  • TD Error

The Key Idea

Instead of waiting for the actual return (like MC does), TD uses the estimate as a target. It updates immediately after one transition — no need to wait for the episode to end.

Comparison: TD vs MC vs DP

PropertyDPMCTD
Requires model?
Bootstraps?
Learns from experience?
Requires complete episodes?N/A
Online (step-by-step)?N/A

The Best of Both Worlds

TD = sample-based (like MC) + bootstrapping (like DP). It doesn’t need a model AND doesn’t need to wait for episode termination.

TD for Control

SARSA — On-Policy TD Control

SARSA Update

Name comes from the quintuple:

Q-Learning — Off-Policy TD Control

Q-Learning Update

Uses instead of following the actual next action — learns about the greedy (optimal) policy regardless of what action the behavior policy takes.

Expected SARSA

Expected SARSA Update

Takes the expectation over next actions under the policy instead of sampling a single . Lower variance than SARSA.

Backup Diagrams

TD(0):

  (S_t)
    |
   [A_t]    ← single sampled action
    |
  (S_{t+1}) ← single sampled next state, uses V(S_{t+1})

Samples one step, bootstraps from the estimate. Contrast with MC (samples to end) and DP (considers all branches).

Key Properties

  • Biased but consistent: TD targets are biased (use estimates), but converge to correct values
  • Lower variance than MC: Because it doesn’t use the full noisy return
  • Can learn online: Updates after every step, no need to wait for episode end
  • Works for continuing tasks: Unlike MC which needs episode termination
  • TD(0) converges to under appropriate step-size conditions (tabular case)

Connections

Appears In