Q-Learning
Definition
Q-Learning
Q-learning is an off-policy TD control algorithm. It directly approximates the optimal action-value function , regardless of the policy being followed. The key insight: the update target uses — the value of the best action in the next state — not the action actually taken.
Update Rule
Q-Learning Update
where:
- — step size (learning rate)
- — TD target (using best next action)
- — bootstrapped estimate of future value under optimal policy
Algorithm
Algorithm: Q-Learning (Off-Policy TD Control)
──────────────────────────────────────────────
Initialize Q(s,a) arbitrarily for all s,a
(Q(terminal, ·) = 0)
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using policy derived from Q
(e.g., ε-greedy w.r.t. Q)
Take action A, observe R, S'
Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]
S ← S'
until S is terminalWhy Off-Policy?
The Max Makes It Off-Policy
The behavior policy (used to select actions) is typically ε-greedy for exploration. But the update target uses — the greedy policy’s value. So we’re learning about the greedy (optimal) policy while following an exploratory policy.
Unlike SARSA, Q-learning doesn’t need Importance Sampling corrections because the max operation directly estimates .
Q-Learning vs SARSA
| Property | Q-Learning | SARSA |
|---|---|---|
| Type | Off-policy | On-policy |
| Target | ||
| Learns about | Optimal (greedy) policy | Current (ε-greedy) policy |
| Cliff Walking behavior | Finds optimal (risky) path | Finds safer path |
| Convergence | To (with conditions) | To for current ε-greedy |
Convergence
Q-learning converges to under standard conditions:
- All state-action pairs visited infinitely often
- Step sizes satisfy: and
With Function Approximation
Tabular Q-learning converges. With Function Approximation, Q-learning can diverge (the Deadly Triad). This motivated Deep Q-Network (DQN)‘s stabilization techniques (Experience Replay + Target Network).
Connections
- Instance of: Temporal Difference Learning (off-policy control)
- Compared with: SARSA (on-policy), Expected SARSA
- Extended by: Double Q-learning, Deep Q-Network (DQN)
- Danger with FA: Deadly Triad