RL-HW01: Homework 1 — Dynamic Programming

Exam Relevance

Questions 2a-2f are classic exam-style problems on Bellman equations, policy iteration, and value iteration. Especially 2f (linear system form of Bellman equations) appears frequently.

Question 1: Policy Iteration vs Value Iteration (2.0p)

Q1: In the lab you implemented value iteration and policy iteration. (a) For which algorithm do you expect a single iteration to run faster? (b) Which algorithm do you expect to take fewer total iterations?

Solution

(a) Single iteration faster: Value Iteration

In Value Iteration, each iteration does one sweep through all states with a max operation: $V (s) \leftarrow max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γV (s^{'})]$
In Policy Iteration, each iteration requires:
1. Policy evaluation: Multiple sweeps until $V$ converges (many inner iterations!)
2. Policy improvement: One sweep with argmax
So a single PI iteration is much more expensive than a single VI iteration.

(b) Fewer total iterations: Policy Iteration

Policy iteration converges in fewer outer iterations because each iteration fully evaluates the current policy before improving it.
Value iteration makes small incremental progress each sweep.
For finite MDPs, policy iteration converges in at most $∣ A ∣^{∣ S ∣}$ iterations (number of possible policies), but in practice converges much faster.

Question 2a: Value in terms of Q (2.0p)

Q: Write the value $v^{π} (s)$ of a state $s$ under policy $π$ in terms of $π$ and $q^{π} (s, a)$ . Give both stochastic and deterministic cases.

Solution

Stochastic Policy

$v^{π} (s) = \sum_{a \in A (s)} π (a ∣ s) q^{π} (s, a)$

Deterministic Policy

$v^{π} (s) = q^{π} (s, π (s))$ where $π (s)$ is the single action the deterministic policy selects in state $s$ .

Question 2b: Q-Value Iteration (1.0p)

Q: Rewrite the Value Iteration update (Eq. 4.10) in terms of Q-values.

Solution

The standard Value Iteration update: $V_{k + 1} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ V_{k} (s^{'})]$

Q-Value Iteration Update

$Q_{k + 1} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ max_{a^{'}} Q_{k} (s^{'}, a^{'})]$

Why This Works

Instead of iterating over $V$ , we iterate over $Q$ directly. The $max_{a^{'}}$ in the target corresponds to acting optimally from the next state — same Bellman optimality structure, just applied to action values.

Question 2c: Policy Evaluation for Q (1.0p)

Q: Rewrite the policy evaluation update (Eq. 4.4) to compute $Q^{π} (s, a)$ instead of $V^{π} (s)$ . The answer should not contain $V$ .

Solution

Policy Evaluation for Q

$Q_{k + 1}^{π} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ \sum_{a^{'}} π (a^{'} ∣ s^{'}) Q_{k}^{π} (s^{'}, a^{'})]$

The inner sum $\sum_{a^{'}} π (a^{'} ∣ s^{'}) Q_{k}^{π} (s^{'}, a^{'})$ replaces what would be $V^{π} (s^{'})$ , using the relationship $v^{π} (s^{'}) = \sum_{a^{'}} π (a^{'} ∣ s^{'}) q^{π} (s^{'}, a^{'})$ .

Question 2d: Policy Improvement for Q (1.0p)

Q: Rewrite the Policy Improvement step (p.80) in terms of $Q^{π} (s, a)$ instead of $V^{π} (s)$ .

Solution

Policy Improvement with Q

$π^{'} (s) = ar g max_{a} Q^{π} (s, a)$

Why Q is Easier

With $V$ : $π^{'} (s) = ar g max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ V^{π} (s^{'})]$ — requires knowing $p$ . With $Q$ : Just take the argmax — no model needed! This is why Q-values are preferred for model-free control.

Question 2e: Policy Evaluation Differences (1.0p)

Q: The policy evaluation step on p.80 is different from the separate algorithm on p.75. What’s the difference and why?

Solution

Page 75 (standalone policy evaluation): Iterates until convergence ( $Δ < θ$ ). Runs many sweeps to get an accurate $V^{π}$ .
Page 80 (within policy iteration): May stop after a single sweep (or a few). This is because full convergence is unnecessary — even an approximate evaluation leads to policy improvement.

This is the GPI Idea

Generalized Policy Iteration: you don’t need perfect evaluation before improving. Any amount of evaluation + improvement progress drives you toward optimality. Value iteration is the extreme: one sweep of evaluation per improvement step.

Question 2f: Bellman as Linear System (2.0p)

Q: For an MDP with two states, write the Bellman equations as a linear system $A [V (s_{1}) V (s_{2})] = b$ . What are $A$ and $b$ ?

Solution

The Bellman Equation for a given policy $π$ (deterministic for simplicity): $V (s) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, π (s)) [r + γV (s^{'})]$

For two states, expanding: $V (s_{1}) = r (s_{1}, π (s_{1})) + γ [p (s_{1} ∣ s_{1}, π (s_{1})) V (s_{1}) + p (s_{2} ∣ s_{1}, π (s_{1})) V (s_{2})]$ $V (s_{2}) = r (s_{2}, π (s_{2})) + γ [p (s_{1} ∣ s_{2}, π (s_{2})) V (s_{1}) + p (s_{2} ∣ s_{2}, π (s_{2})) V (s_{2})]$

Rearranging ( $V (s) - γ \sum_{s^{'}} p (s^{'} ∣ s, a) V (s^{'}) = r (s, a)$ ):

Linear System $A v = b$

$[1 - γ p (s_{1} ∣ s_{1}, π (s_{1})) - γ p (s_{1} ∣ s_{2}, π (s_{2})) - γ p (s_{2} ∣ s_{1}, π (s_{1})) 1 - γ p (s_{2} ∣ s_{2}, π (s_{2}))] [V (s_{1}) V (s_{2})] = [r (s_{1}, π (s_{1})) r (s_{2}, π (s_{2}))]$

In compact form: $A = I - γ P_{π}$ and $b = r_{π}$ .

This Generalizes

For $n$ states: $v_{π} = (I - γ P_{π})^{- 1} r_{π}$ . This is the closed-form solution to the Bellman equation. Only practical for small state spaces (matrix inversion is $O (n^{3})$ ). See also LSTD which exploits this structure with function approximation.

Study Notes

Explorer

RL-HW01 - Homework 1

RL-HW01: Homework 1 — Dynamic Programming

Question 1: Policy Iteration vs Value Iteration (2.0p)

Solution

Question 2a: Value in terms of Q (2.0p)

Solution

Question 2b: Q-Value Iteration (1.0p)

Solution

Question 2c: Policy Evaluation for Q (1.0p)

Solution

Question 2d: Policy Improvement for Q (1.0p)

Solution

Question 2e: Policy Evaluation Differences (1.0p)

Solution

Question 2f: Bellman as Linear System (2.0p)

Solution

Graph View

Table of Contents

Backlinks