Maximum Entropy RL
Maximum Entropy RL
A framework that augments the standard RL objective with an entropy bonus, encouraging the agent to maximize expected return while acting as randomly as possible. The agent simultaneously maximizes reward and policy entropy.
Objective
Maximum Entropy Objective
where:
- — entropy of the policy at state
- — temperature parameter controlling the exploration–exploitation tradeoff
- — state-action distribution induced by
Intuition
Why Add Entropy?
Standard RL finds a single optimal action per state. Maximum Entropy RL says: “Among all policies that achieve high reward, prefer the one that is most random.” This has several benefits:
- Better exploration: the agent is incentivized to try diverse actions
- Robustness: the policy doesn’t collapse to a single brittle action
- Multi-modality: can capture multiple near-optimal strategies
- Composability: entropy-regularized policies combine well across tasks
Effect of Temperature
| Behavior | |
|---|---|
| Standard (greedy) RL — exploit only | |
| small | Slight exploration bonus |
| large | Highly stochastic — explore aggressively |
| Uniform random policy |
Soft Bellman Equation
The entropy bonus modifies the Bellman equations:
Soft Value Functions
Key Properties
- Provides a principled way to trade off exploration and exploitation via
- Leads to stochastic optimal policies (unlike standard RL which yields deterministic ones)
- Foundation for Soft Actor-Critic (SAC)
- Can be interpreted as KL-regularized RL (keeping policy close to a uniform prior)
Connections
- Implemented by Soft Actor-Critic (SAC)
- Related to Exploration vs Exploitation — entropy provides intrinsic exploration
- Builds on Policy Gradient Methods and Actor-Critic
- Temperature plays a similar role to exploration parameters in Epsilon-Greedy Policy