Deep Q-Network (DQN)

Definition

Deep Q-Network

DQN (Mnih et al., 2015) is a deep RL algorithm that combines Q-Learning with deep neural networks to handle high-dimensional state spaces (e.g., raw pixel input). It introduced two key stabilization techniques — Experience Replay and Target Network — to address the instability of combining Function Approximation with bootstrapping.

Architecture

  • Input: State representation (e.g., 4 stacked frames of Atari game pixels: )
  • Network: Convolutional neural network
  • Output: for all actions simultaneously (one output per action)
  • Action selection: (with ε-greedy for exploration)

Loss Function

DQN Loss

where:

  • — current network weights (being trained)
  • target network weights (frozen copy, updated periodically)
  • replay buffer of past transitions
  • — sampled transition

Key Innovation 1: Experience Replay

Experience Replay

Store transitions in a replay buffer . Train by sampling random mini-batches from rather than using consecutive transitions.

Why it helps:

  1. Breaks temporal correlations: Consecutive samples are highly correlated → bad for SGD. Random sampling decorrelates.
  2. Data efficiency: Each transition can be reused many times.
  3. Smooths over data distribution: Averages over many past policies’ behavior.

Key Innovation 2: Target Network

Target Network

A separate copy of the Q-network with weights , updated to match only every steps (or via soft update ).

Why it helps:

  • Without it, the target changes with every weight update → moving target problem → instability
  • Freezing the target network for steps provides a stable regression target

Algorithm

Algorithm: Deep Q-Network (DQN)
───────────────────────────────
Initialize replay buffer D (capacity N)
Initialize Q-network with random weights θ
Initialize target network with weights θ⁻ = θ
 
For episode = 1 to M:
  Initialize state S₁ (e.g., preprocess game frame)
  For t = 1 to T:
    With probability ε: select random action Aₜ
    Otherwise: Aₜ = argmax_a Q(Sₜ, a; θ)
    
    Execute Aₜ, observe Rₜ₊₁, Sₜ₊₁
    Store (Sₜ, Aₜ, Rₜ₊₁, Sₜ₊₁) in D
    
    Sample random minibatch of (sⱼ, aⱼ, rⱼ, s'ⱼ) from D
    Set yⱼ = rⱼ + γ max_{a'} Q(s'ⱼ, a'; θ⁻)  [or yⱼ = rⱼ if terminal]
    
    Perform gradient descent on (yⱼ - Q(sⱼ, aⱼ; θ))² w.r.t. θ
    
    Every C steps: θ⁻ ← θ

DQN Improvements

VariantKey Idea
Double DQNUse online network to select action, target network to evaluate: . Reduces overestimation bias.
Dueling DQNSeparate network streams for and advantage :
Prioritized ReplaySample important transitions (high TD error) more frequently

Relation to the Deadly Triad

DQN has all three deadly triad elements (FA + bootstrapping + off-policy). It works in practice due to:

  • Experience replay → stabilizes data distribution
  • Target network → stabilizes bootstrap targets
  • No theoretical convergence guarantee, but empirically very successful

Conservative Q-Learning (CQL)

Extension for offline RL (learning from fixed datasets without environment interaction). Adds a regularizer that pushes down Q-values for unseen actions to avoid overestimation on out-of-distribution actions.

Connections

Appears In