Dynamic Programming
Definition
Dynamic Programming
Dynamic Programming refers to a collection of algorithms that compute optimal policies given a perfect model of the environment (i.e., the MDP dynamics ). DP uses the Bellman Equation as an update rule to iteratively improve value estimates.
Core Idea
DP turns the Bellman equation into an assignment (update rule). Instead of solving a system of equations, it repeatedly applies the Bellman equation as an update until convergence. “Sweep” through all states, update each one, repeat.
Key Algorithms
Policy Evaluation (Prediction)
Compute for a given policy by iterative application of the Bellman equation:
Iterative Policy Evaluation Update
Repeat until (convergence threshold).
Policy Iteration
Alternates between evaluation and improvement:
- Policy Evaluation: Compute (iteratively until convergence)
- Policy Improvement: (greedy w.r.t. current value function)
- Repeat until policy is stable ()
Guaranteed to converge to optimal policy in finite number of iterations (for finite MDPs).
Value Iteration
Combines evaluation and improvement into a single update:
Value Iteration Update
Equivalent to: one sweep of policy evaluation + greedy policy improvement. Converges to .
Generalized Policy Iteration (GPI)
GPI
Any interaction between policy evaluation and policy improvement, regardless of granularity. Value iteration, policy iteration, and most RL algorithms are instances of GPI.
Evaluation ←→ Improvement ↓ ↓ v ≈ v_π π ≈ greedy(v) ↘ ↙ v* and π*
Limitations
- Requires full model: Must know for all transitions
- Curse of dimensionality: Sweeps over all states — infeasible for large/continuous state spaces
- Full-width backups: Each update considers all possible next states
Why DP Matters Despite Limitations
DP provides the theoretical foundation for all of RL. MC and TD methods are essentially doing DP-like updates but with samples instead of expectations. Understanding DP is key to understanding everything else.
Connections
- Solves: Bellman Equation, Bellman Optimality Equation
- Requires: Markov Decision Process model
- Generalized by: Generalized Policy Iteration
- Sample-based alternatives: Monte Carlo Methods, Temporal Difference Learning
- With approximation: Function Approximation, Semi-Gradient Methods