Conservative Q-Learning (CQL)

Conservative Q-Learning

CQL (Kumar et al., 2020) is an offline RL algorithm that learns a conservative (pessimistic) Q-function. It penalizes Q-values for actions not seen in the dataset, preventing overestimation on out-of-distribution actions.

The Offline RL Problem

Why Standard Q-Learning Fails Offline

In offline RL, we learn from a fixed dataset without further environment interaction. Standard Q-Learning / DQN can overestimate Q-values for actions not in the dataset (because might select an action we’ve never seen, whose Q-value is unreliable). This causes extrapolation error that compounds through bootstrapping.

CQL Key Idea

Add a regularizer that pushes down Q-values for all actions, then pushes up Q-values for actions in the dataset:

CQL Objective (simplified)

  • First term: pushes down Q-values for all actions (via logsumexp, which emphasizes high Q actions)
  • Second term: pushes up Q-values for dataset actions
  • Net effect: Q-values for unseen actions are conservatively low

Connections

Appears In