RL-HW01: Homework 1 — Dynamic Programming
Exam Relevance
Questions 2a-2f are classic exam-style problems on Bellman equations, policy iteration, and value iteration. Especially 2f (linear system form of Bellman equations) appears frequently.
Question 1: Policy Iteration vs Value Iteration (2.0p)
Q1: In the lab you implemented value iteration and policy iteration. (a) For which algorithm do you expect a single iteration to run faster? (b) Which algorithm do you expect to take fewer total iterations?
Solution
(a) Single iteration faster: Value Iteration
- In Value Iteration, each iteration does one sweep through all states with a max operation:
- In Policy Iteration, each iteration requires:
- Policy evaluation: Multiple sweeps until converges (many inner iterations!)
- Policy improvement: One sweep with argmax
- So a single PI iteration is much more expensive than a single VI iteration.
(b) Fewer total iterations: Policy Iteration
- Policy iteration converges in fewer outer iterations because each iteration fully evaluates the current policy before improving it.
- Value iteration makes small incremental progress each sweep.
- For finite MDPs, policy iteration converges in at most iterations (number of possible policies), but in practice converges much faster.
Question 2a: Value in terms of Q (2.0p)
Q: Write the value of a state under policy in terms of and . Give both stochastic and deterministic cases.
Solution
Stochastic Policy
Deterministic Policy
where is the single action the deterministic policy selects in state .
Question 2b: Q-Value Iteration (1.0p)
Q: Rewrite the Value Iteration update (Eq. 4.10) in terms of Q-values.
Solution
The standard Value Iteration update:
Q-Value Iteration Update
Why This Works
Instead of iterating over , we iterate over directly. The in the target corresponds to acting optimally from the next state — same Bellman optimality structure, just applied to action values.
Question 2c: Policy Evaluation for Q (1.0p)
Q: Rewrite the policy evaluation update (Eq. 4.4) to compute instead of . The answer should not contain .
Solution
Policy Evaluation for Q
The inner sum replaces what would be , using the relationship .
Question 2d: Policy Improvement for Q (1.0p)
Q: Rewrite the Policy Improvement step (p.80) in terms of instead of .
Solution
Policy Improvement with Q
Why Q is Easier
With : — requires knowing . With : Just take the argmax — no model needed! This is why Q-values are preferred for model-free control.
Question 2e: Policy Evaluation Differences (1.0p)
Q: The policy evaluation step on p.80 is different from the separate algorithm on p.75. What’s the difference and why?
Solution
- Page 75 (standalone policy evaluation): Iterates until convergence (). Runs many sweeps to get an accurate .
- Page 80 (within policy iteration): May stop after a single sweep (or a few). This is because full convergence is unnecessary — even an approximate evaluation leads to policy improvement.
This is the GPI Idea
Generalized Policy Iteration: you don’t need perfect evaluation before improving. Any amount of evaluation + improvement progress drives you toward optimality. Value iteration is the extreme: one sweep of evaluation per improvement step.
Question 2f: Bellman as Linear System (2.0p)
Q: For an MDP with two states, write the Bellman equations as a linear system . What are and ?
Solution
The Bellman Equation for a given policy (deterministic for simplicity):
For two states, expanding:
Rearranging ():
Linear System
In compact form: and .
This Generalizes
For states: . This is the closed-form solution to the Bellman equation. Only practical for small state spaces (matrix inversion is ). See also LSTD which exploits this structure with function approximation.