Conservative Q-Learning (CQL)

Conservative Q-Learning

CQL (Kumar et al., 2020) is an offline RL algorithm that learns a conservative (pessimistic) Q-function. It penalizes Q-values for actions not seen in the dataset, preventing overestimation on out-of-distribution actions.

The Offline RL Problem

Why Standard Q-Learning Fails Offline

In offline RL, we learn from a fixed dataset without further environment interaction. Standard Q-Learning / DQN can overestimate Q-values for actions not in the dataset (because $max_{a} Q (s^{'}, a)$ might select an action we’ve never seen, whose Q-value is unreliable). This causes extrapolation error that compounds through bootstrapping.

CQL Key Idea

Add a regularizer that pushes down Q-values for all actions, then pushes up Q-values for actions in the dataset:

CQL Objective (simplified)

$min_{Q} α (E_{s \sim D} [lo g \sum_{a} exp Q (s, a)] - E_{(s, a) \sim D} [Q (s, a)]) + standard TD loss$

First term: pushes down Q-values for all actions (via logsumexp, which emphasizes high Q actions)

Second term: pushes up Q-values for dataset actions

Net effect: Q-values for unseen actions are conservatively low

Connections

Addresses: Offline Reinforcement Learning distribution shift
Extends: Deep Q-Network (DQN)
Alternative: BCQ, BEAR, IQL (other offline RL methods)

Appears In

RL-L08 - Deep RL Value-Based

Study Notes

Explorer

Conservative Q-Learning (CQL)

Conservative Q-Learning (CQL)

The Offline RL Problem

CQL Key Idea

Connections

Appears In

Graph View

Table of Contents

Backlinks