Entropy

Definition

(Shannon) Entropy

The entropy of a discrete probability distribution $p$ measures its uncertainty — the expected amount of “surprise” (in nats if using $ln$ , or bits if using $lo g_{2}$ ) when sampling from it. For a policy $π (\cdot ∣ s)$ over actions, the entropy is $H (π (\cdot ∣ s)) = - \sum_{a} π (a ∣ s) lo g π (a ∣ s) = E_{a \sim π} [- lo g π (a ∣ s)] .$ It is maximized by the uniform distribution (maximum uncertainty / exploration) and minimized (=0) by a deterministic distribution (a point mass — full certainty / greedy).

Intuition

Entropy answers: “how spread out is this distribution?”

A uniform policy over $∣ A ∣$ actions has the largest entropy, $lo g ∣ A ∣$ — every action is equally likely, so you are maximally uncertain and maximally exploratory.
A deterministic / peaked policy (one action has probability $\approx 1$ ) has entropy $\approx 0$ — no surprise, but also no exploration.

In RL we exploit this directly: adding an entropy term to the objective discourages the policy from collapsing too early onto a single action. This keeps the policy stochastic, preserving exploration and preventing premature convergence to a suboptimal deterministic policy. The information-theoretic reading is that $- lo g p (x)$ is the self-information (“surprisal”) of outcome $x$ ; entropy is its expectation.

Mathematical Formulation

Entropy of a policy. For state $s$ , $H (π (\cdot ∣ s)) = - \sum_{a \in A} π_{θ} (a ∣ s) lo g π_{θ} (a ∣ s) .$

where:

$π_{θ} (a ∣ s)$ — probability the policy assigns to action $a$ in state $s$
the sum runs over all actions; for continuous actions it becomes an integral (differential entropy)
$H \geq 0$ for discrete distributions, with $0 \leq H \leq lo g ∣ A ∣$

Entropy regularization (entropy bonus). Policy-gradient methods add an entropy term to encourage exploration. For REINFORCE / Actor-Critic the per-step objective gradient becomes $\nabla_{θ} J (θ) \propto E [\nabla_{θ} lo g π_{θ} (a ∣ s) (G_{t} - b (s)) + β \nabla_{θ} H (π_{θ} (\cdot ∣ s))] .$

where:

$G_{t} - b (s)$ — return minus Baseline (the Advantage signal driving the policy update)
$β$ — entropy coefficient (regularization strength); larger $β \Rightarrow$ more exploration
$\nabla_{θ} H$ — pushes $π_{θ}$ toward higher entropy (more uniform)

Maximum-entropy objective. Soft Actor-Critic (SAC) augments the reward with an entropy term at every step, yielding the Maximum Entropy RL objective $J (π) = \sum_{t} E_{(s_{t}, a_{t}) \sim π} [r (s_{t}, a_{t}) + α H (π (\cdot ∣ s_{t}))] .$

where:

$r (s_{t}, a_{t})$ — environment reward
$α$ — temperature, trading off reward vs. entropy ( $α \to 0$ recovers standard RL)
$H (π (\cdot ∣ s_{t}))$ — policy entropy, here treated as an intrinsic reward for acting stochastically

Key Properties / Variants

Bounds: $0 \leq H (π) \leq lo g ∣ A ∣$ (discrete). Maximum at uniform $π$ , minimum at a deterministic $π$ .
Concavity: $H$ is a concave function of the distribution, so an entropy bonus is a concave regularizer (well-behaved for gradient ascent).
Self-information: $H = E [- lo g p (x)]$ ; the integrand $- lo g p (x)$ is the surprisal of a single outcome.
Relation to cross-entropy / KL: $D_{KL} (p ∥ q) = cross-entropy E_{p} [- lo g q] - H (p) (- E_{p} [- lo g p])$ , i.e. cross-entropy $=$ entropy $+$ KL divergence. Minimizing KL with fixed $p$ is the same as minimizing cross-entropy.
Temperature link: in a Softmax Policy $π (a ∣ s) \propto exp (f_{θ} (s, a) / τ)$ , raising $τ$ raises entropy (toward uniform); lowering $τ \to 0$ drives entropy to $0$ (toward argmax).
Differential entropy: for a continuous policy (e.g. a Gaussian Policy) entropy depends on the variance; a Gaussian’s entropy is $\frac{1}{2} lo g (2 π e σ^{2})$ per dimension. Unlike the discrete case it can be negative.

Computing an entropy bonus for a softmax policy:

Function: entropy_bonus(logits, beta)
─────────────────────────────────────
  p   ← softmax(logits)                 # action probabilities π(a|s)
  logp ← log_softmax(logits)            # numerically stable log π(a|s)
  H   ← -Σ_a  p[a] * logp[a]            # Shannon entropy of the policy
  return beta * H                       # add to objective (gradient ASCENT on H)

Sign and Coefficient

Entropy is added to the objective for gradient ascent (or its negative is subtracted from a loss for gradient descent). Get the sign wrong and you penalize exploration, collapsing the policy. The coefficient ( $β$ or temperature $α$ ) must be tuned/annealed: too high keeps the policy near-uniform and it never exploits; too low gives no exploration benefit. In SAC, $α$ is often learned automatically to hit a target entropy.

Connections

Regularizes / explores in: Softmax Policy, REINFORCE, Actor-Critic, PPO, A3C
Core of: Maximum Entropy RL, Soft Actor-Critic (SAC)
Continuous-action entropy: Gaussian Policy
Alternative to exploration via: Epsilon-Greedy, Optimistic Initial Values
Information-theoretic relatives: cross-entropy, KL divergence

Appears In

Softmax Policy — uses policy entropy as its built-in exploration mechanism
RL-L11 - SAC, Decision Transformer & Diffuser
RL-L09 - Policy Gradient Methods
RL-L10 - Advanced Policy Search

Study Notes

Explorer

Entropy

Entropy

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks