Dynamic Programming

Definition

Dynamic Programming

Dynamic Programming refers to a collection of algorithms that compute optimal policies given a perfect model of the environment (i.e., the MDP dynamics $p (s^{'}, r ∣ s, a)$ ). DP uses the Bellman Equation as an update rule to iteratively improve value estimates.

Core Idea

DP turns the Bellman equation into an assignment (update rule). Instead of solving a system of equations, it repeatedly applies the Bellman equation as an update until convergence. “Sweep” through all states, update each one, repeat.

Key Algorithms

Policy Evaluation (Prediction)

Compute $v_{π}$ for a given policy $π$ by iterative application of the Bellman equation:

Iterative Policy Evaluation Update

$V_{k + 1} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ V_{k} (s^{'})] \forall s \in S$

Repeat until $max_{s} ∣ V_{k + 1} (s) - V_{k} (s) ∣ < θ$ (convergence threshold).

Policy Iteration

Alternates between evaluation and improvement:

Policy Evaluation: Compute $v_{π}$ (iteratively until convergence)
Policy Improvement: $π^{'} (s) = ar g max_{a} q_{π} (s, a)$ (greedy w.r.t. current value function)
Repeat until policy is stable ( $π^{'} = π$ )

Guaranteed to converge to optimal policy in finite number of iterations (for finite MDPs).

Value Iteration

Combines evaluation and improvement into a single update:

Value Iteration Update

$V_{k + 1} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ V_{k} (s^{'})] \forall s \in S$

Equivalent to: one sweep of policy evaluation + greedy policy improvement. Converges to $v_{*}$ .

Generalized Policy Iteration (GPI)

GPI
Any interaction between policy evaluation and policy improvement, regardless of granularity. Value iteration, policy iteration, and most RL algorithms are instances of GPI.
Evaluation ←→ Improvement
    ↓              ↓
  v ≈ v_π      π ≈ greedy(v)
    ↘              ↙
       v* and π*

Limitations

Requires full model: Must know $p (s^{'}, r ∣ s, a)$ for all transitions
Curse of dimensionality: Sweeps over all states — infeasible for large/continuous state spaces
Full-width backups: Each update considers all possible next states

Why DP Matters Despite Limitations

DP provides the theoretical foundation for all of RL. MC and TD methods are essentially doing DP-like updates but with samples instead of expectations. Understanding DP is key to understanding everything else.

Connections

Solves: Bellman Equation, Bellman Optimality Equation
Requires: Markov Decision Process model
Generalized by: Generalized Policy Iteration
Sample-based alternatives: Monte Carlo Methods, Temporal Difference Learning
With approximation: Function Approximation, Semi-Gradient Methods

Study Notes

Explorer

Dynamic Programming

Dynamic Programming

Definition

Key Algorithms

Policy Evaluation (Prediction)

Policy Iteration

Value Iteration

Generalized Policy Iteration (GPI)

Limitations

Connections

Appears In

Graph View

Table of Contents

Backlinks