Off-Policy Learning

Definition

Off-Policy Learning

Off-policy learning is a reinforcement learning paradigm where the target policy (the policy being learned) is different from the behavior policy (the policy used to generate data/interact with the environment).

Key Components

  • Target Policy (): The policy we want to evaluate or optimize (often the greedy policy).
  • Behavior Policy (): The policy used to explore and collect experience (often -greedy).

Intuition

Learning by Watching

Off-policy learning is like learning to drive by watching a movie of someone else driving. You can evaluate how good their choices were (target policy) even though you aren’t the one making them (behavior policy). This allows you to “re-watch” old experiences and learn from them even after your driving style has changed.

Comparison: On vs Off

FeatureOn-Policy LearningOff-Policy Learning
Data SourceCurrent policyAny policy (old self, human, random)
VarianceTypically LowerTypically Higher (requires Importance Sampling)
EfficiencyLess sample efficientMore efficient (supports Experience Replay)
StabilityGenerally more stableCan be unstable with FA (Deadly Triad)

Mechanisms

To learn off-policy, one must account for the difference in distributions:

  1. Importance Sampling: Weighting returns by the ratio to correct for the frequency of actions.
  2. Max Operator: Algorithms like Q-Learning avoid importance sampling by directly updating towards the maximum possible value, effectively learning the greedy policy regardless of the behavior.

Connections

Appears In