Off-Policy Learning
Definition
Off-Policy Learning
Off-policy learning is a reinforcement learning paradigm where the target policy (the policy being learned) is different from the behavior policy (the policy used to generate data/interact with the environment).
Key Components
- Target Policy (): The policy we want to evaluate or optimize (often the greedy policy).
- Behavior Policy (): The policy used to explore and collect experience (often -greedy).
Intuition
Learning by Watching
Off-policy learning is like learning to drive by watching a movie of someone else driving. You can evaluate how good their choices were (target policy) even though you aren’t the one making them (behavior policy). This allows you to “re-watch” old experiences and learn from them even after your driving style has changed.
Comparison: On vs Off
| Feature | On-Policy Learning | Off-Policy Learning |
|---|---|---|
| Data Source | Current policy | Any policy (old self, human, random) |
| Variance | Typically Lower | Typically Higher (requires Importance Sampling) |
| Efficiency | Less sample efficient | More efficient (supports Experience Replay) |
| Stability | Generally more stable | Can be unstable with FA (Deadly Triad) |
Mechanisms
To learn off-policy, one must account for the difference in distributions:
- Importance Sampling: Weighting returns by the ratio to correct for the frequency of actions.
- Max Operator: Algorithms like Q-Learning avoid importance sampling by directly updating towards the maximum possible value, effectively learning the greedy policy regardless of the behavior.
Connections
- Primary Example: Q-Learning
- Risk: Deadly Triad (Function Approximation + Bootstrapping + Off-Policy)
- Enables: Experience Replay
- Method: Importance Sampling