Bellman Equation

Definition

Bellman Equation

The Bellman equation expresses the value of a state (or state-action pair) as the immediate reward plus the discounted value of the successor state. It captures the recursive structure of the Value Function: the value of a state depends on the values of its possible successor states.

Bellman Equation for $v_{π}$ (State-Value)

Bellman Equation for $v_{π}$

$v_{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]$

where:

$v_{π} (s)$ — value of state $s$ under Policy $π$

$π (a ∣ s)$ — probability of taking action $a$ in state $s$

$p (s^{'}, r ∣ s, a)$ — MDP dynamics

$r$ — immediate reward

$γ$ — Discount Factor

$v_{π} (s^{'})$ — value of successor state

Reading the Equation

“The value of a state = weighted average over all actions I might take (weighted by my policy) of: [the immediate reward I get + discounted value of where I end up].”

It’s an expectation over the next step, then recursion handles the rest. This is the key insight — you don’t need to look all the way to the end. Just one step ahead + the value of where you land.

Alternative compact form: $v_{π} (s) = E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s]$

Bellman Equation for $q_{π}$ (Action-Value)

Bellman Equation for $q_{π}$

$q_{π} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ \sum_{a^{'}} π (a^{'} ∣ s^{'}) q_{π} (s^{'}, a^{'})]$

or equivalently: $q_{π} (s, a) = E_{π} [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) ∣ S_{t} = s, A_{t} = a]$

The relationship between $v_{π}$ and $q_{π}$ : $v_{π} (s) = \sum_{a} π (a ∣ s) q_{π} (s, a)$

Bellman Optimality Equations

Bellman Optimality Equation for $v_{*}$

$v_{*} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{*} (s^{'})]$

The optimal value of a state is achieved by always picking the best action.

Bellman Optimality Equation for $q_{*}$

$q_{*} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ max_{a^{'}} q_{*} (s^{'}, a^{'})]$

The optimal action-value: immediate reward + the best you can do from the next state.

Optimality = Max Instead of Average

Compare the regular Bellman equation (average over policy) vs optimality equation (max over actions). Regular: “What do I expect under my current policy?” Optimal: “What’s the best I could ever do?”

Backup Diagrams

Bellman equation for $v_{π}$ :

         (s)          ← state node (open circle)
        / | \
       a₁ a₂ a₃      ← action nodes (solid dots), weighted by π(a|s)
      /|  |  |\
    s' s' s' s' s'    ← next states, weighted by p(s',r|s,a)

White circles = state nodes, black dots = action nodes. Each branch represents one possible action and one possible next-state transition.

Bellman optimality for $v_{*}$ :

         (s)
        / | \
       a₁ a₂ a₃      ← MAX over actions (arc across branches)
      /|  |  |\
    s' s' s' s' s'    ← weighted by p(s',r|s,a)

The “max” replaces the weighted average over actions.

Solving Bellman Equations

Method	Approach	Requires Model?
Dynamic Programming	Iterative solution of Bellman equations	Yes
Monte Carlo Methods	Sample-based estimation of expectations	No
Temporal Difference Learning	Bootstrapped sample-based estimation	No

The Bellman equation is a system of $∣ S ∣$ linear equations (for $v_{π}$ ) — solvable directly for small state spaces, iteratively for large ones.

Key Properties

Linearity: The Bellman equation for $v_{π}$ is linear in $v_{π}$ (it’s $v = r_{π} + γ P_{π} v$ in matrix form)
Contraction: The Bellman operator is a $γ$ -contraction mapping → unique fixed point, iterative methods converge
Foundation: Every RL algorithm is essentially approximating or solving some form of Bellman equation

Common Exam Mistake

Don’t mix up the Bellman equation (for a given policy $π$ ) with the Bellman optimality equation (for the optimal policy $π_{*}$ ). The first uses $\sum_{a} π (a ∣ s)$ , the second uses $max_{a}$ .

Connections

Defined on: Markov Decision Process
Solved by: Dynamic Programming, Policy Iteration, Value Iteration
Approximated by: Temporal Difference Learning, Monte Carlo Methods
Extended: Bellman Error, Bellman Optimality Equation

Study Notes

Explorer

Bellman Equation

Bellman Equation

Definition

Bellman Equation for $v_{π}$ (State-Value)

Bellman Equation for $q_{π}$ (Action-Value)

Bellman Optimality Equations

Backup Diagrams

Solving Bellman Equations

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

Bellman Equation

Bellman Equation

Definition

Bellman Equation for vπ​ (State-Value)

Bellman Equation for qπ​ (Action-Value)

Bellman Optimality Equations

Backup Diagrams

Solving Bellman Equations

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks

Bellman Equation for $v_{π}$ (State-Value)

Bellman Equation for $q_{π}$ (Action-Value)