Value Iteration

Value Iteration

Value iteration combines one sweep of policy evaluation with policy improvement into a single update using the Bellman Optimality Equation:

Algorithm: Value Iteration
──────────────────────────
Initialize V(s) arbitrarily (V(terminal) = 0)
 
Loop until convergence (Δ < θ):
  Δ ← 0
  For each s ∈ S:
    v ← V(s)
    V(s) ← max_a Σ_{s',r} p(s',r|s,a) [r + γV(s')]
    Δ ← max(Δ, |v - V(s)|)
 
Output: π(s) = argmax_a Σ_{s',r} p(s',r|s,a) [r + γV(s')]

Truncated Policy Iteration

Value iteration = Policy Iteration where the evaluation step is truncated to a single sweep. It converges because each sweep applies the Bellman optimality operator, which is a γ-contraction.

  • Converges to for any initialization
  • Often faster than policy iteration in total compute (fewer sweeps overall)
  • Policy can be extracted at the end from using one greedy step

Appears In