Policy Iteration

Policy Iteration

Policy iteration finds the optimal policy by alternating two steps until convergence:

  1. Policy Evaluation: Compute for the current policy (iteratively until convergence)
  2. Policy Improvement: Make the policy greedy w.r.t. the value function:
Algorithm: Policy Iteration
───────────────────────────
Initialize V(s) and π(s) arbitrarily for all s
 
1. Policy Evaluation:
   Loop until convergence:
     For each s ∈ S:
       V(s) ← Σ_{s',r} p(s',r|s,π(s)) [r + γV(s')]
 
2. Policy Improvement:
   policy_stable ← true
   For each s ∈ S:
     old_action ← π(s)
     π(s) ← argmax_a Σ_{s',r} p(s',r|s,a) [r + γV(s')]
     If old_action ≠ π(s): policy_stable ← false
   
   If policy_stable: stop (found π*)
   Else: go to step 1

Policy Iteration vs Value Iteration

Policy iteration does full evaluation (many sweeps until converges) then one improvement step. Value iteration does one sweep of evaluation combined with improvement. Policy iteration typically needs fewer total iterations but each iteration is more expensive.

Appears In