RL Lecture 8: Deep RL (Value-Based Methods)

1. Why Deep RL?

Traditional Reinforcement Learning (tabular or linear function approximation) faces significant hurdles in complex environments:

  • Curse of Dimensionality: Tabular methods (e.g., standard Q-learning) cannot scale to high-dimensional state spaces (like pixels).
  • Manual Feature Engineering: Linear function approximation requires expert-designed features, which are often task-specific and brittle.
  • Limited Representational Power: Linear models cannot capture the non-linear relationships often required for complex tasks (e.g., visual perception).

Transition to Deep RL: Deep Reinforcement Learning replaces manual feature engineering with Representation Learning. By using deep neural networks (especially CNNs), agents can learn features directly from raw input (pixels) end-to-end.


2. Deep Q-Network (DQN)

Introduced by Mnih et al. (2015), DQN was the first algorithm to achieve human-level performance across a diverse range of tasks (Atari 2600 games) using the same architecture and hyperparameters.

2.1 Pre-processing

To handle raw pixels efficiently, DQN applies several task-agnostic steps:

  1. Downscaling & Grayscale: Images are reduced to resolution and converted to grayscale to save memory.
  2. Frame Stacking: The agent receives a stack of the 4 most recent frames as input. This provides a “short memory” allowed the agent to infer velocity and direction (e.g., whether a ball is moving up or down).
  3. Reward Clipping: Rewards are clipped to to stabilize the gradients across different games.

2.2 Architecture

The DQN architecture is a deep Convolutional Neural Network (CNN):

  • Input: (pre-processed frames).
  • Conv Layers: Three convolutional layers with ReLU activations to extract spatial features.
  • Fully Connected Layers: One or more FC layers to map features to action values.
  • Output: A separate output for each possible action. This layout is more efficient than taking an action as input, as it computes all action values in a single forward pass.

DQN Architecture Illustration|600

Figure 1: DQN Convolutional Architecture

The network takes the 4 most recent frames as input. It consists of three convolutional layers (extracting features like ball positions and motion) followed by fully connected layers that output a -value for each action. Note the output layout: all actions are computed in parallel.

2.3 Key Techniques for Stability

Training deep neural networks with Q-learning is notoriously unstable. DQN solves this using two main techniques:

Experience Replay

  • Mechanism: Transitions are stored in a large buffer (circular queue).
  • Training: At each step, a small random minibatch is sampled from the buffer for the update.
  • Why it helps:
    1. Breaks Correlations: Successive transitions in an episode are highly correlated. Randomized sampling makes the data look more like the i.i.d. data used in supervised learning.
    2. Data Efficiency: Experiences are reused multiple times for learning.

Target Network

  • Mechanism: DQN maintains two networks: the Online Network () and the Target Network ().
  • Update: The target network weights are kept fixed and only synchronized with the online network every steps ().
  • Bellman Update:
  • Why it helps: In standard Q-learning, the target changes with every weight update, leading to feedback loops and oscillations. A fixed target provides a stable “ground truth” to move towards.

3. The DQN Algorithm

The loss function for DQN is the mean squared error (MSE) of the Bellman residual:

Algorithm: DQN with Experience Replay

  1. Initialize replay memory to capacity
  2. Initialize action-value function with random weights
  3. Initialize target action-value function with weights
  4. For episode = 1 to do:
    1. Initialize state and pre-process
    2. For to do:
      1. With probability select random action , else
      2. Execute , observe reward and image
      3. Pre-process
      4. Store transition in
      5. Sample minibatch from
      6. Set (if not terminal)
      7. Perform gradient descent step on
      8. Every steps, set

4. DQN Extensions

Several institutional improvements have been proposed:

  • Double DQN: Mitigates overestimation bias by decoupling action selection from evaluation.
  • Prioritized Experience Replay: Samples transitions with higher TD error more frequently.
  • Dueling DQN: Splits the network into and .
  • Rainbow DQN: Combines all-of-the-above (plus multi-step, distributional, and noisy layers) for drastically better performance.

Rainbow Convergence|500

Figure 2: The "Rainbow" of Improvements

Learning curves show that combining individual “tricks” (Double DQN, Dueling, etc.) leads to significantly faster and more stable convergence compared to vanilla DQN.


5. Atari Results & Significance

  • Human-Level Performance: DQN outperformed humans on 22/49 games.
  • Significance: Proved that a single architecture could learn diverse skills (reflexes in Breakout, precision in Space Invaders, management in Seaquest) solely from pixels and score.
  • Failure Cases: Struggles on games requiring long-term planning/exploration (e.g., Montezuma’s Revenge).

6. Offline Reinforcement Learning

Offline RL is the task of learning a policy from a fixed dataset collected by some behavior policy , without further interaction.

6.1 The Challenge: Distribution Shift

Standard Q-learning fails because the agent evaluates actions that are Out-of-Distribution (OOD).

Visualization of the Problem

Imagine an MDP with three actions:

  1. (Well-seen): Real Q=10, Estimated Q=10.5
  2. (Rarely seen): Real Q=8, Estimated Q=5
  3. (Unseen/OOD): Real Q=8, Estimated Q=11 (Error due to FA)

Standard Q-learning will calculate and pick action , even though it is sub-optimal. This overestimated error propagates through the Bellman backup, causing the value function to “explode.”


7. Conservative Q-Learning (CQL)

Proposed by Kumar et al. (2020), CQL addresses this by learning a conservative (pessimistic) -function.

7.1 Key Idea: Expected Pessimism

Instead of point-wise guarantees, CQL ensures that the expected -value under the learned policy is a lower bound.

7.2 The CQL Loss Function

CQL adds a regularizer that penalizes -values for actions where :

  • Practical Implementation: For discrete actions, the first term is implemented using a logsumexp over all actions.
  • Result: Proved to guarantee under-estimation, which prevents the policy from “tripping” over OOD action errors. Experiments show CQL performs significantly better than other offline methods on small, noisy datasets.

CQL Underestimation Table|600

Figure: CQL empirically shows lower (conservative) values compared to standard Q-learning or ensembles, which tend to diverge.


Reference: Mnih et al. (2015), Kumar et al. (2020), Sutton & Barto (2018) Ch 16.5.