RL Lecture 8: Deep RL (Value-Based Methods)

1. Why Deep RL?

Traditional Reinforcement Learning (tabular or linear function approximation) faces significant hurdles in complex environments:

Curse of Dimensionality: Tabular methods (e.g., standard Q-learning) cannot scale to high-dimensional state spaces (like pixels).
Manual Feature Engineering: Linear function approximation requires expert-designed features, which are often task-specific and brittle.
Limited Representational Power: Linear models cannot capture the non-linear relationships often required for complex tasks (e.g., visual perception).

Transition to Deep RL: Deep Reinforcement Learning replaces manual feature engineering with Representation Learning. By using deep neural networks (especially CNNs), agents can learn features directly from raw input (pixels) end-to-end.

2. Deep Q-Network (DQN)

Introduced by Mnih et al. (2015), DQN was the first algorithm to achieve human-level performance across a diverse range of tasks (Atari 2600 games) using the same architecture and hyperparameters.

2.1 Pre-processing

To handle raw pixels efficiently, DQN applies several task-agnostic steps:

Downscaling & Grayscale: Images are reduced to $84 \times 84$ resolution and converted to grayscale to save memory.
Frame Stacking: The agent receives a stack of the 4 most recent frames as input. This provides a “short memory” allowed the agent to infer velocity and direction (e.g., whether a ball is moving up or down).
Reward Clipping: Rewards are clipped to $[- 1, 0, 1]$ to stabilize the gradients across different games.

2.2 Architecture

The DQN architecture is a deep Convolutional Neural Network (CNN):

Input: $84 \times 84 \times 4$ (pre-processed frames).
Conv Layers: Three convolutional layers with ReLU activations to extract spatial features.
Fully Connected Layers: One or more FC layers to map features to action values.
Output: A separate output for each possible action. This layout is more efficient than taking an action as input, as it computes all action values in a single forward pass.

DQN Architecture Illustration|600

Figure 1: DQN Convolutional Architecture

The network takes the 4 most recent $84 \times 84$ frames as input. It consists of three convolutional layers (extracting features like ball positions and motion) followed by fully connected layers that output a $Q$ -value for each action. Note the output layout: all actions are computed in parallel.

2.3 Key Techniques for Stability

Training deep neural networks with Q-learning is notoriously unstable. DQN solves this using two main techniques:

Experience Replay

Mechanism: Transitions $(S_{t}, A_{t}, R_{t + 1}, S_{t + 1})$ are stored in a large buffer (circular queue).
Training: At each step, a small random minibatch is sampled from the buffer for the update.
Why it helps:
1. Breaks Correlations: Successive transitions in an episode are highly correlated. Randomized sampling makes the data look more like the i.i.d. data used in supervised learning.
2. Data Efficiency: Experiences are reused multiple times for learning.

Target Network

Mechanism: DQN maintains two networks: the Online Network ( $w$ ) and the Target Network ( $\tilde{w}$ ).
Update: The target network weights are kept fixed and only synchronized with the online network every $C$ steps ( $\tilde{w} \leftarrow w$ ).
Bellman Update: $y_{j} = R_{j} + γ max_{a^{'}} \overset{q}{^} (S_{j}^{'}, a^{'}, \tilde{w})$
Why it helps: In standard Q-learning, the target changes with every weight update, leading to feedback loops and oscillations. A fixed target provides a stable “ground truth” to move towards.

3. The DQN Algorithm

The loss function for DQN is the mean squared error (MSE) of the Bellman residual: $L (w) = E_{(s, a, r, s^{'}) \sim D} [(R + γ max_{a^{'}} \overset{q}{^} (s^{'}, a^{'}, \tilde{w}) - \overset{q}{^} (s, a, w))^{2}]$

Algorithm: DQN with Experience Replay

Initialize replay memory $D$ to capacity $N$

Initialize action-value function $Q$ with random weights $θ$

Initialize target action-value function $\hat{Q}$ with weights $θ^{-} = θ$

For episode = 1 to $M$ do:

Initialize state $s_{1}$ and pre-process $ϕ_{1} = ϕ (s_{1})$

For $t = 1$ to $T$ do:

With probability $ϵ$ select random action $a_{t}$ , else $a_{t} = ar g max_{a} Q (ϕ_{t}, a; θ)$

Execute $a_{t}$ , observe reward $r_{t}$ and image $x_{t + 1}$

Pre-process $ϕ_{t + 1} = ϕ (s_{t}, a_{t}, x_{t + 1})$

Store transition $(ϕ_{t}, a_{t}, r_{t}, ϕ_{t + 1})$ in $D$

Sample minibatch $(ϕ_{j}, a_{j}, r_{j}, ϕ_{j + 1})$ from $D$

Set $y_{j} = r_{j} + γ max_{a^{'}} \hat{Q} (ϕ_{j + 1}, a^{'}; θ^{-})$ (if not terminal)

Perform gradient descent step on $(y_{j} - Q (ϕ_{j}, a_{j}; θ))^{2}$

Every $C$ steps, set $θ^{-} \leftarrow θ$

4. DQN Extensions

Several institutional improvements have been proposed:

Double DQN: Mitigates overestimation bias by decoupling action selection from evaluation.
Prioritized Experience Replay: Samples transitions with higher TD error more frequently.
Dueling DQN: Splits the network into $V (s)$ and $A (s, a)$ .
Rainbow DQN: Combines all-of-the-above (plus multi-step, distributional, and noisy layers) for drastically better performance.

Rainbow Convergence|500

Figure 2: The "Rainbow" of Improvements

Learning curves show that combining individual “tricks” (Double DQN, Dueling, etc.) leads to significantly faster and more stable convergence compared to vanilla DQN.

5. Atari Results & Significance

Human-Level Performance: DQN outperformed humans on 22/49 games.
Significance: Proved that a single architecture could learn diverse skills (reflexes in Breakout, precision in Space Invaders, management in Seaquest) solely from pixels and score.
Failure Cases: Struggles on games requiring long-term planning/exploration (e.g., Montezuma’s Revenge).

6. Offline Reinforcement Learning

Offline RL is the task of learning a policy from a fixed dataset $D$ collected by some behavior policy $β$ , without further interaction.

6.1 The Challenge: Distribution Shift

Standard Q-learning fails because the agent evaluates actions that are Out-of-Distribution (OOD).

Visualization of the Problem

Imagine an MDP with three actions:

$a_{1}$ (Well-seen): Real Q=10, Estimated Q=10.5

$a_{2}$ (Rarely seen): Real Q=8, Estimated Q=5

$a_{3}$ (Unseen/OOD): Real Q=8, Estimated Q=11 (Error due to FA)

Standard Q-learning will calculate $max_{a} Q (s, a)$ and pick action $a_{3}$ , even though it is sub-optimal. This overestimated error propagates through the Bellman backup, causing the value function to “explode.”

7. Conservative Q-Learning (CQL)

Proposed by Kumar et al. (2020), CQL addresses this by learning a conservative (pessimistic) $Q$ -function.

7.1 Key Idea: Expected Pessimism

Instead of point-wise guarantees, CQL ensures that the expected $Q$ -value under the learned policy is a lower bound. $V_{l e a r n e d}^{π} (s) \leq V_{t r u e}^{π} (s) (with high probability)$

7.2 The CQL Loss Function

CQL adds a regularizer that penalizes $Q$ -values for actions where $π (a ∣ s) > β (a ∣ s)$ : $min_{Q} α (E_{s \sim D, a \sim π (a ∣ s)} [Q (s, a)] - E_{s \sim D, a \sim β (a ∣ s)} [Q (s, a)]) + Loss_{B e ll man}$

Practical Implementation: For discrete actions, the first term is implemented using a logsumexp over all actions.
Result: Proved to guarantee under-estimation, which prevents the policy from “tripping” over OOD action errors. Experiments show CQL performs significantly better than other offline methods on small, noisy datasets.

CQL Underestimation Table|600

Figure: CQL empirically shows lower (conservative) values compared to standard Q-learning or ensembles, which tend to diverge.

Reference: Mnih et al. (2015), Kumar et al. (2020), Sutton & Barto (2018) Ch 16.5.

Study Notes

Explorer

RL-L08 - Deep RL Value-Based

RL Lecture 8: Deep RL (Value-Based Methods)

1. Why Deep RL?

2. Deep Q-Network (DQN)

2.1 Pre-processing

2.2 Architecture

2.3 Key Techniques for Stability

Experience Replay

Target Network

3. The DQN Algorithm

4. DQN Extensions

5. Atari Results & Significance

6. Offline Reinforcement Learning

6.1 The Challenge: Distribution Shift

7. Conservative Q-Learning (CQL)

7.1 Key Idea: Expected Pessimism

7.2 The CQL Loss Function

Graph View

Table of Contents

Backlinks