Maximum Entropy RL

Maximum Entropy RL

A framework that augments the standard RL objective with an entropy bonus, encouraging the agent to maximize expected return while acting as randomly as possible. The agent simultaneously maximizes reward and policy entropy.

Objective

Maximum Entropy Objective

$J (π) = \sum_{t = 0}^{T} E_{(s_{t}, a_{t}) \sim ρ_{π}} [r (s_{t}, a_{t}) + α H (π (\cdot ∣ s_{t}))]$

where:

$H (π (\cdot ∣ s)) = - E_{a \sim π} [lo g π (a ∣ s)]$ — entropy of the policy at state $s$

$α > 0$ — temperature parameter controlling the exploration–exploitation tradeoff

$ρ_{π}$ — state-action distribution induced by $π$

Intuition

Why Add Entropy?

Standard RL finds a single optimal action per state. Maximum Entropy RL says: “Among all policies that achieve high reward, prefer the one that is most random.” This has several benefits:

Better exploration: the agent is incentivized to try diverse actions

Robustness: the policy doesn’t collapse to a single brittle action

Multi-modality: can capture multiple near-optimal strategies

Composability: entropy-regularized policies combine well across tasks

Effect of Temperature $α$

$α$	Behavior
$α \to 0$	Standard (greedy) RL — exploit only
$α$ small	Slight exploration bonus
$α$ large	Highly stochastic — explore aggressively
$α \to \infty$	Uniform random policy

Soft Bellman Equation

The entropy bonus modifies the Bellman equations:

Soft Value Functions

$V_{soft} (s) = E_{a \sim π} [Q_{soft} (s, a) - α lo g π (a ∣ s)]$

$Q_{soft} (s, a) = r (s, a) + γ E_{s^{'} \sim p} [V_{soft} (s^{'})]$

Key Properties

Provides a principled way to trade off exploration and exploitation via $α$
Leads to stochastic optimal policies (unlike standard RL which yields deterministic ones)
Foundation for Soft Actor-Critic (SAC)
Can be interpreted as KL-regularized RL (keeping policy close to a uniform prior)

Connections

Implemented by Soft Actor-Critic (SAC)
Related to Exploration vs Exploitation — entropy provides intrinsic exploration
Builds on Policy Gradient Methods and Actor-Critic
Temperature $α$ plays a similar role to exploration parameters in Epsilon-Greedy Policy

Appears In

RL-L11 - SAC, Decision Transformer & Diffuser

Study Notes

Explorer

Maximum Entropy RL

Maximum Entropy RL

Objective

Intuition

Effect of Temperature $α$

Soft Bellman Equation

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

Maximum Entropy RL

Maximum Entropy RL

Objective

Intuition

Effect of Temperature α

Soft Bellman Equation

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks

Effect of Temperature $α$