Bellman Equation

Definition

Bellman Equation

The Bellman equation expresses the value of a state (or state-action pair) as the immediate reward plus the discounted value of the successor state. It captures the recursive structure of the Value Function: the value of a state depends on the values of its possible successor states.

Bellman Equation for (State-Value)

Bellman Equation for

where:

  • — value of state under Policy
  • — probability of taking action in state
  • MDP dynamics
  • — immediate reward
  • Discount Factor
  • — value of successor state

Reading the Equation

“The value of a state = weighted average over all actions I might take (weighted by my policy) of: [the immediate reward I get + discounted value of where I end up].”

It’s an expectation over the next step, then recursion handles the rest. This is the key insight — you don’t need to look all the way to the end. Just one step ahead + the value of where you land.

Alternative compact form:

Bellman Equation for (Action-Value)

Bellman Equation for

or equivalently:

The relationship between and :

Bellman Optimality Equations

Bellman Optimality Equation for

The optimal value of a state is achieved by always picking the best action.

Bellman Optimality Equation for

The optimal action-value: immediate reward + the best you can do from the next state.

Optimality = Max Instead of Average

Compare the regular Bellman equation (average over policy) vs optimality equation (max over actions). Regular: “What do I expect under my current policy?” Optimal: “What’s the best I could ever do?”

Backup Diagrams

Bellman equation for :

         (s)          ← state node (open circle)
        / | \
       a₁ a₂ a₃      ← action nodes (solid dots), weighted by π(a|s)
      /|  |  |\
    s' s' s' s' s'    ← next states, weighted by p(s',r|s,a)

White circles = state nodes, black dots = action nodes. Each branch represents one possible action and one possible next-state transition.

Bellman optimality for :

         (s)
        / | \
       a₁ a₂ a₃      ← MAX over actions (arc across branches)
      /|  |  |\
    s' s' s' s' s'    ← weighted by p(s',r|s,a)

The “max” replaces the weighted average over actions.

Solving Bellman Equations

MethodApproachRequires Model?
Dynamic ProgrammingIterative solution of Bellman equationsYes
Monte Carlo MethodsSample-based estimation of expectationsNo
Temporal Difference LearningBootstrapped sample-based estimationNo

The Bellman equation is a system of linear equations (for ) — solvable directly for small state spaces, iteratively for large ones.

Key Properties

  • Linearity: The Bellman equation for is linear in (it’s in matrix form)
  • Contraction: The Bellman operator is a -contraction mapping → unique fixed point, iterative methods converge
  • Foundation: Every RL algorithm is essentially approximating or solving some form of Bellman equation

Common Exam Mistake

Don’t mix up the Bellman equation (for a given policy ) with the Bellman optimality equation (for the optimal policy ). The first uses , the second uses .

Connections

Appears In