Exploration vs Exploitation
Definition
Exploration vs Exploitation Dilemma
The fundamental trade-off in RL between:
- Exploitation: Choosing the action with the highest estimated value (using current knowledge to maximize immediate reward)
- Exploration: Choosing non-greedy actions to gather more information about their values (potentially discovering better long-term strategies)
The Restaurant Analogy
You know a good restaurant nearby (exploitation). But there might be an even better one you haven’t tried (exploration). Every night you eat at the known-good place, you miss the chance to discover something better. But every night you try somewhere random, you risk a bad meal. The optimal strategy is somewhere in between.
Why It Matters
- If you only exploit: You may converge to a suboptimal policy because you never discovered better actions
- If you only explore: You waste time on clearly bad actions and never capitalize on what you’ve learned
- The optimal balance depends on the time horizon: more exploration early, more exploitation later
Methods for Balancing
| Method | Type | Key Idea |
|---|---|---|
| Epsilon-Greedy Policy | Action selection | Random action with probability |
| Upper Confidence Bound | Action selection | Bonus for uncertainty: prefer under-explored actions |
| Optimistic Initial Values | Initialization | High initial values drive early exploration |
| Gradient Bandit | Policy gradient | Softmax over learned preferences |
| Exploring Starts | Episode init | Random starting state-action pairs |
| Boltzmann/Softmax | Action selection | Temperature-controlled randomness |
In Different RL Contexts
- Multi-Armed Bandit: Pure exploration-exploitation (no state transitions)
- Monte Carlo Methods: ε-greedy or exploring starts for exploration
- Temporal Difference Learning: ε-greedy during training (SARSA, Q-Learning)
- Deep RL: ε-greedy with decay, entropy regularization, curiosity-driven exploration
Connections
- Core to: Multi-Armed Bandit, Monte Carlo Control
- Methods: Epsilon-Greedy Policy, Upper Confidence Bound, Exploring Starts
- Affects: convergence speed, optimality of learned policy