Deep Q-Network (DQN)
Definition
Deep Q-Network
DQN (Mnih et al., 2015) is a deep RL algorithm that combines Q-Learning with deep neural networks to handle high-dimensional state spaces (e.g., raw pixel input). It introduced two key stabilization techniques — Experience Replay and Target Network — to address the instability of combining Function Approximation with bootstrapping.
Architecture
- Input: State representation (e.g., 4 stacked frames of Atari game pixels: )
- Network: Convolutional neural network
- Output: for all actions simultaneously (one output per action)
- Action selection: (with ε-greedy for exploration)
Loss Function
DQN Loss
where:
- — current network weights (being trained)
- — target network weights (frozen copy, updated periodically)
- — replay buffer of past transitions
- — sampled transition
Key Innovation 1: Experience Replay
Experience Replay
Store transitions in a replay buffer . Train by sampling random mini-batches from rather than using consecutive transitions.
Why it helps:
- Breaks temporal correlations: Consecutive samples are highly correlated → bad for SGD. Random sampling decorrelates.
- Data efficiency: Each transition can be reused many times.
- Smooths over data distribution: Averages over many past policies’ behavior.
Key Innovation 2: Target Network
Target Network
A separate copy of the Q-network with weights , updated to match only every steps (or via soft update ).
Why it helps:
- Without it, the target changes with every weight update → moving target problem → instability
- Freezing the target network for steps provides a stable regression target
Algorithm
Algorithm: Deep Q-Network (DQN)
───────────────────────────────
Initialize replay buffer D (capacity N)
Initialize Q-network with random weights θ
Initialize target network with weights θ⁻ = θ
For episode = 1 to M:
Initialize state S₁ (e.g., preprocess game frame)
For t = 1 to T:
With probability ε: select random action Aₜ
Otherwise: Aₜ = argmax_a Q(Sₜ, a; θ)
Execute Aₜ, observe Rₜ₊₁, Sₜ₊₁
Store (Sₜ, Aₜ, Rₜ₊₁, Sₜ₊₁) in D
Sample random minibatch of (sⱼ, aⱼ, rⱼ, s'ⱼ) from D
Set yⱼ = rⱼ + γ max_{a'} Q(s'ⱼ, a'; θ⁻) [or yⱼ = rⱼ if terminal]
Perform gradient descent on (yⱼ - Q(sⱼ, aⱼ; θ))² w.r.t. θ
Every C steps: θ⁻ ← θDQN Improvements
| Variant | Key Idea |
|---|---|
| Double DQN | Use online network to select action, target network to evaluate: . Reduces overestimation bias. |
| Dueling DQN | Separate network streams for and advantage : |
| Prioritized Replay | Sample important transitions (high TD error) more frequently |
Relation to the Deadly Triad
DQN has all three deadly triad elements (FA + bootstrapping + off-policy). It works in practice due to:
- Experience replay → stabilizes data distribution
- Target network → stabilizes bootstrap targets
- No theoretical convergence guarantee, but empirically very successful
Conservative Q-Learning (CQL)
Extension for offline RL (learning from fixed datasets without environment interaction). Adds a regularizer that pushes down Q-values for unseen actions to avoid overestimation on out-of-distribution actions.
Connections
- Extends: Q-Learning with deep neural networks
- Stabilized by: Experience Replay, Target Network
- Addresses: Deadly Triad (practically, not theoretically)
- Offline variant: Conservative Q-Learning (CQL)
- Alternatives: SARSA variants, policy gradient methods