Offline Reinforcement Learning
Offline RL
Offline RL (also called batch RL) learns a policy from a fixed dataset of previously collected transitions, without any further interaction with the environment.
Key challenge: distribution shift — the learned policy may select actions not well-represented in the dataset, leading to unreliable value estimates.
Solutions: Conservative Q-Learning (CQL), BCQ, BEAR, IQL — all constrain the learned policy to stay close to the dataset distribution.