RL Lecture 8: Deep RL (Value-Based Methods)
1. Why Deep RL?
Traditional Reinforcement Learning (tabular or linear function approximation) faces significant hurdles in complex environments:
- Curse of Dimensionality: Tabular methods (e.g., standard Q-learning) cannot scale to high-dimensional state spaces (like pixels).
- Manual Feature Engineering: Linear function approximation requires expert-designed features, which are often task-specific and brittle.
- Limited Representational Power: Linear models cannot capture the non-linear relationships often required for complex tasks (e.g., visual perception).
Transition to Deep RL: Deep Reinforcement Learning replaces manual feature engineering with Representation Learning. By using deep neural networks (especially CNNs), agents can learn features directly from raw input (pixels) end-to-end.
2. Deep Q-Network (DQN)
Introduced by Mnih et al. (2015), DQN was the first algorithm to achieve human-level performance across a diverse range of tasks (Atari 2600 games) using the same architecture and hyperparameters.
2.1 Pre-processing
To handle raw pixels efficiently, DQN applies several task-agnostic steps:
- Downscaling & Grayscale: Images are reduced to resolution and converted to grayscale to save memory.
- Frame Stacking: The agent receives a stack of the 4 most recent frames as input. This provides a “short memory” allowed the agent to infer velocity and direction (e.g., whether a ball is moving up or down).
- Reward Clipping: Rewards are clipped to to stabilize the gradients across different games.
2.2 Architecture
The DQN architecture is a deep Convolutional Neural Network (CNN):
- Input: (pre-processed frames).
- Conv Layers: Three convolutional layers with ReLU activations to extract spatial features.
- Fully Connected Layers: One or more FC layers to map features to action values.
- Output: A separate output for each possible action. This layout is more efficient than taking an action as input, as it computes all action values in a single forward pass.

Figure 1: DQN Convolutional Architecture
The network takes the 4 most recent frames as input. It consists of three convolutional layers (extracting features like ball positions and motion) followed by fully connected layers that output a -value for each action. Note the output layout: all actions are computed in parallel.
2.3 Key Techniques for Stability
Training deep neural networks with Q-learning is notoriously unstable. DQN solves this using two main techniques:
Experience Replay
- Mechanism: Transitions are stored in a large buffer (circular queue).
- Training: At each step, a small random minibatch is sampled from the buffer for the update.
- Why it helps:
- Breaks Correlations: Successive transitions in an episode are highly correlated. Randomized sampling makes the data look more like the i.i.d. data used in supervised learning.
- Data Efficiency: Experiences are reused multiple times for learning.
Target Network
- Mechanism: DQN maintains two networks: the Online Network () and the Target Network ().
- Update: The target network weights are kept fixed and only synchronized with the online network every steps ().
- Bellman Update:
- Why it helps: In standard Q-learning, the target changes with every weight update, leading to feedback loops and oscillations. A fixed target provides a stable “ground truth” to move towards.
3. The DQN Algorithm
The loss function for DQN is the mean squared error (MSE) of the Bellman residual:
Algorithm: DQN with Experience Replay
- Initialize replay memory to capacity
- Initialize action-value function with random weights
- Initialize target action-value function with weights
- For episode = 1 to do:
- Initialize state and pre-process
- For to do:
- With probability select random action , else
- Execute , observe reward and image
- Pre-process
- Store transition in
- Sample minibatch from
- Set (if not terminal)
- Perform gradient descent step on
- Every steps, set
4. DQN Extensions
Several institutional improvements have been proposed:
- Double DQN: Mitigates overestimation bias by decoupling action selection from evaluation.
- Prioritized Experience Replay: Samples transitions with higher TD error more frequently.
- Dueling DQN: Splits the network into and .
- Rainbow DQN: Combines all-of-the-above (plus multi-step, distributional, and noisy layers) for drastically better performance.

Figure 2: The "Rainbow" of Improvements
Learning curves show that combining individual “tricks” (Double DQN, Dueling, etc.) leads to significantly faster and more stable convergence compared to vanilla DQN.
5. Atari Results & Significance
- Human-Level Performance: DQN outperformed humans on 22/49 games.
- Significance: Proved that a single architecture could learn diverse skills (reflexes in Breakout, precision in Space Invaders, management in Seaquest) solely from pixels and score.
- Failure Cases: Struggles on games requiring long-term planning/exploration (e.g., Montezuma’s Revenge).
6. Offline Reinforcement Learning
Offline RL is the task of learning a policy from a fixed dataset collected by some behavior policy , without further interaction.
6.1 The Challenge: Distribution Shift
Standard Q-learning fails because the agent evaluates actions that are Out-of-Distribution (OOD).
Visualization of the Problem
Imagine an MDP with three actions:
- (Well-seen): Real Q=10, Estimated Q=10.5
- (Rarely seen): Real Q=8, Estimated Q=5
- (Unseen/OOD): Real Q=8, Estimated Q=11 (Error due to FA)
Standard Q-learning will calculate and pick action , even though it is sub-optimal. This overestimated error propagates through the Bellman backup, causing the value function to “explode.”
7. Conservative Q-Learning (CQL)
Proposed by Kumar et al. (2020), CQL addresses this by learning a conservative (pessimistic) -function.
7.1 Key Idea: Expected Pessimism
Instead of point-wise guarantees, CQL ensures that the expected -value under the learned policy is a lower bound.
7.2 The CQL Loss Function
CQL adds a regularizer that penalizes -values for actions where :
- Practical Implementation: For discrete actions, the first term is implemented using a
logsumexpover all actions. - Result: Proved to guarantee under-estimation, which prevents the policy from “tripping” over OOD action errors. Experiments show CQL performs significantly better than other offline methods on small, noisy datasets.

Figure: CQL empirically shows lower (conservative) values compared to standard Q-learning or ensembles, which tend to diverge.
Reference: Mnih et al. (2015), Kumar et al. (2020), Sutton & Barto (2018) Ch 16.5.