Entropy
Definition
(Shannon) Entropy
The entropy of a discrete probability distribution measures its uncertainty — the expected amount of “surprise” (in nats if using , or bits if using ) when sampling from it. For a policy over actions, the entropy is It is maximized by the uniform distribution (maximum uncertainty / exploration) and minimized (=0) by a deterministic distribution (a point mass — full certainty / greedy).
Intuition
Entropy answers: “how spread out is this distribution?”
- A uniform policy over actions has the largest entropy, — every action is equally likely, so you are maximally uncertain and maximally exploratory.
- A deterministic / peaked policy (one action has probability ) has entropy — no surprise, but also no exploration.
In RL we exploit this directly: adding an entropy term to the objective discourages the policy from collapsing too early onto a single action. This keeps the policy stochastic, preserving exploration and preventing premature convergence to a suboptimal deterministic policy. The information-theoretic reading is that is the self-information (“surprisal”) of outcome ; entropy is its expectation.
Mathematical Formulation
Entropy of a policy. For state ,
where:
- — probability the policy assigns to action in state
- the sum runs over all actions; for continuous actions it becomes an integral (differential entropy)
- for discrete distributions, with
Entropy regularization (entropy bonus). Policy-gradient methods add an entropy term to encourage exploration. For REINFORCE / Actor-Critic the per-step objective gradient becomes
where:
- — return minus Baseline (the Advantage signal driving the policy update)
- — entropy coefficient (regularization strength); larger more exploration
- — pushes toward higher entropy (more uniform)
Maximum-entropy objective. Soft Actor-Critic (SAC) augments the reward with an entropy term at every step, yielding the Maximum Entropy RL objective
where:
- — environment reward
- — temperature, trading off reward vs. entropy ( recovers standard RL)
- — policy entropy, here treated as an intrinsic reward for acting stochastically
Key Properties / Variants
- Bounds: (discrete). Maximum at uniform , minimum at a deterministic .
- Concavity: is a concave function of the distribution, so an entropy bonus is a concave regularizer (well-behaved for gradient ascent).
- Self-information: ; the integrand is the surprisal of a single outcome.
- Relation to cross-entropy / KL: , i.e. cross-entropy entropy KL divergence. Minimizing KL with fixed is the same as minimizing cross-entropy.
- Temperature link: in a Softmax Policy , raising raises entropy (toward uniform); lowering drives entropy to (toward argmax).
- Differential entropy: for a continuous policy (e.g. a Gaussian Policy) entropy depends on the variance; a Gaussian’s entropy is per dimension. Unlike the discrete case it can be negative.
Computing an entropy bonus for a softmax policy:
Function: entropy_bonus(logits, beta)
─────────────────────────────────────
p ← softmax(logits) # action probabilities π(a|s)
logp ← log_softmax(logits) # numerically stable log π(a|s)
H ← -Σ_a p[a] * logp[a] # Shannon entropy of the policy
return beta * H # add to objective (gradient ASCENT on H)Sign and Coefficient
Entropy is added to the objective for gradient ascent (or its negative is subtracted from a loss for gradient descent). Get the sign wrong and you penalize exploration, collapsing the policy. The coefficient ( or temperature ) must be tuned/annealed: too high keeps the policy near-uniform and it never exploits; too low gives no exploration benefit. In SAC, is often learned automatically to hit a target entropy.
Connections
- Regularizes / explores in: Softmax Policy, REINFORCE, Actor-Critic, PPO, A3C
- Core of: Maximum Entropy RL, Soft Actor-Critic (SAC)
- Continuous-action entropy: Gaussian Policy
- Alternative to exploration via: Epsilon-Greedy, Optimistic Initial Values
- Information-theoretic relatives: cross-entropy, KL divergence
Appears In
- Softmax Policy — uses policy entropy as its built-in exploration mechanism
- RL-L11 - SAC, Decision Transformer & Diffuser
- RL-L09 - Policy Gradient Methods
- RL-L10 - Advanced Policy Search