Bellman Equation
Definition
Bellman Equation
The Bellman equation expresses the value of a state (or state-action pair) as the immediate reward plus the discounted value of the successor state. It captures the recursive structure of the Value Function: the value of a state depends on the values of its possible successor states.
Bellman Equation for (State-Value)
Bellman Equation for
where:
- — value of state under Policy
- — probability of taking action in state
- — MDP dynamics
- — immediate reward
- — Discount Factor
- — value of successor state
Reading the Equation
“The value of a state = weighted average over all actions I might take (weighted by my policy) of: [the immediate reward I get + discounted value of where I end up].”
It’s an expectation over the next step, then recursion handles the rest. This is the key insight — you don’t need to look all the way to the end. Just one step ahead + the value of where you land.
Alternative compact form:
Bellman Equation for (Action-Value)
Bellman Equation for
or equivalently:
The relationship between and :
Bellman Optimality Equations
Bellman Optimality Equation for
The optimal value of a state is achieved by always picking the best action.
Bellman Optimality Equation for
The optimal action-value: immediate reward + the best you can do from the next state.
Optimality = Max Instead of Average
Compare the regular Bellman equation (average over policy) vs optimality equation (max over actions). Regular: “What do I expect under my current policy?” Optimal: “What’s the best I could ever do?”
Backup Diagrams
Bellman equation for :
(s) ← state node (open circle)
/ | \
a₁ a₂ a₃ ← action nodes (solid dots), weighted by π(a|s)
/| | |\
s' s' s' s' s' ← next states, weighted by p(s',r|s,a)
White circles = state nodes, black dots = action nodes. Each branch represents one possible action and one possible next-state transition.
Bellman optimality for :
(s)
/ | \
a₁ a₂ a₃ ← MAX over actions (arc across branches)
/| | |\
s' s' s' s' s' ← weighted by p(s',r|s,a)
The “max” replaces the weighted average over actions.
Solving Bellman Equations
| Method | Approach | Requires Model? |
|---|---|---|
| Dynamic Programming | Iterative solution of Bellman equations | Yes |
| Monte Carlo Methods | Sample-based estimation of expectations | No |
| Temporal Difference Learning | Bootstrapped sample-based estimation | No |
The Bellman equation is a system of linear equations (for ) — solvable directly for small state spaces, iteratively for large ones.
Key Properties
- Linearity: The Bellman equation for is linear in (it’s in matrix form)
- Contraction: The Bellman operator is a -contraction mapping → unique fixed point, iterative methods converge
- Foundation: Every RL algorithm is essentially approximating or solving some form of Bellman equation
Common Exam Mistake
Don’t mix up the Bellman equation (for a given policy ) with the Bellman optimality equation (for the optimal policy ). The first uses , the second uses .
Connections
- Defined on: Markov Decision Process
- Solved by: Dynamic Programming, Policy Iteration, Value Iteration
- Approximated by: Temporal Difference Learning, Monte Carlo Methods
- Extended: Bellman Error, Bellman Optimality Equation