Deep Q-Network (DQN)

Definition

Deep Q-Network

DQN (Mnih et al., 2015) is a deep RL algorithm that combines Q-Learning with deep neural networks to handle high-dimensional state spaces (e.g., raw pixel input). It introduced two key stabilization techniques — Experience Replay and Target Network — to address the instability of combining Function Approximation with bootstrapping.

Architecture

Input: State representation (e.g., 4 stacked frames of Atari game pixels: $84 \times 84 \times 4$ )
Network: Convolutional neural network
Output: $Q (s, a; θ)$ for all actions simultaneously (one output per action)
Action selection: $A_{t} = ar g max_{a} Q (S_{t}, a; θ)$ (with ε-greedy for exploration)

Loss Function

DQN Loss

$L (θ) = E_{(s, a, r, s^{'}) \sim D} [(r + γ max_{a^{'}} Q (s^{'}, a^{'}; θ^{-}) - Q (s, a; θ))^{2}]$

where:

$θ$ — current network weights (being trained)

$θ^{-}$ — target network weights (frozen copy, updated periodically)

$D$ — replay buffer of past transitions

$(s, a, r, s^{'})$ — sampled transition

Key Innovation 1: Experience Replay

Experience Replay

Store transitions $(S_{t}, A_{t}, R_{t + 1}, S_{t + 1})$ in a replay buffer $D$ . Train by sampling random mini-batches from $D$ rather than using consecutive transitions.

Why it helps:

Breaks temporal correlations: Consecutive samples are highly correlated → bad for SGD. Random sampling decorrelates.
Data efficiency: Each transition can be reused many times.
Smooths over data distribution: Averages over many past policies’ behavior.

Key Innovation 2: Target Network

Target Network

A separate copy of the Q-network with weights $θ^{-}$ , updated to match $θ$ only every $C$ steps (or via soft update $θ^{-} \leftarrow τ θ + (1 - τ) θ^{-}$ ).

Why it helps:

Without it, the target $r + γ max_{a^{'}} Q (s^{'}, a^{'}; θ)$ changes with every weight update → moving target problem → instability
Freezing the target network for $C$ steps provides a stable regression target

Algorithm

Algorithm: Deep Q-Network (DQN)
───────────────────────────────
Initialize replay buffer D (capacity N)
Initialize Q-network with random weights θ
Initialize target network with weights θ⁻ = θ
 
For episode = 1 to M:
  Initialize state S₁ (e.g., preprocess game frame)
  For t = 1 to T:
    With probability ε: select random action Aₜ
    Otherwise: Aₜ = argmax_a Q(Sₜ, a; θ)
    
    Execute Aₜ, observe Rₜ₊₁, Sₜ₊₁
    Store (Sₜ, Aₜ, Rₜ₊₁, Sₜ₊₁) in D
    
    Sample random minibatch of (sⱼ, aⱼ, rⱼ, s'ⱼ) from D
    Set yⱼ = rⱼ + γ max_{a'} Q(s'ⱼ, a'; θ⁻)  [or yⱼ = rⱼ if terminal]
    
    Perform gradient descent on (yⱼ - Q(sⱼ, aⱼ; θ))² w.r.t. θ
    
    Every C steps: θ⁻ ← θ

DQN Improvements

Variant	Key Idea
Double DQN	Use online network to select action, target network to evaluate: $y = r + γ Q (s^{'}, ar g max_{a^{'}} Q (s^{'}, a^{'}; θ); θ^{-})$ . Reduces overestimation bias.
Dueling DQN	Separate network streams for $V (s)$ and advantage $A (s, a)$ : $Q (s, a) = V (s) + A (s, a) - mean (A)$
Prioritized Replay	Sample important transitions (high TD error) more frequently

Relation to the Deadly Triad

DQN has all three deadly triad elements (FA + bootstrapping + off-policy). It works in practice due to:

Experience replay → stabilizes data distribution
Target network → stabilizes bootstrap targets
No theoretical convergence guarantee, but empirically very successful

Conservative Q-Learning (CQL)

Extension for offline RL (learning from fixed datasets without environment interaction). Adds a regularizer that pushes down Q-values for unseen actions to avoid overestimation on out-of-distribution actions.

Connections

Extends: Q-Learning with deep neural networks
Stabilized by: Experience Replay, Target Network
Addresses: Deadly Triad (practically, not theoretically)
Offline variant: Conservative Q-Learning (CQL)
Alternatives: SARSA variants, policy gradient methods

Study Notes

Explorer

Deep Q-Network (DQN)

Deep Q-Network (DQN)

Definition

Architecture

Loss Function

Key Innovation 1: Experience Replay

Key Innovation 2: Target Network

Algorithm

DQN Improvements

Relation to the Deadly Triad

Conservative Q-Learning (CQL)

Connections

Appears In

Graph View

Table of Contents

Backlinks