RL-HW01: Homework 1 — Dynamic Programming

Exam Relevance

Questions 2a-2f are classic exam-style problems on Bellman equations, policy iteration, and value iteration. Especially 2f (linear system form of Bellman equations) appears frequently.

Question 1: Policy Iteration vs Value Iteration (2.0p)

Q1: In the lab you implemented value iteration and policy iteration. (a) For which algorithm do you expect a single iteration to run faster? (b) Which algorithm do you expect to take fewer total iterations?

Solution

(a) Single iteration faster: Value Iteration

  • In Value Iteration, each iteration does one sweep through all states with a max operation:
  • In Policy Iteration, each iteration requires:
    1. Policy evaluation: Multiple sweeps until converges (many inner iterations!)
    2. Policy improvement: One sweep with argmax
  • So a single PI iteration is much more expensive than a single VI iteration.

(b) Fewer total iterations: Policy Iteration

  • Policy iteration converges in fewer outer iterations because each iteration fully evaluates the current policy before improving it.
  • Value iteration makes small incremental progress each sweep.
  • For finite MDPs, policy iteration converges in at most iterations (number of possible policies), but in practice converges much faster.

Question 2a: Value in terms of Q (2.0p)

Q: Write the value of a state under policy in terms of and . Give both stochastic and deterministic cases.

Solution

Stochastic Policy

Deterministic Policy

where is the single action the deterministic policy selects in state .


Question 2b: Q-Value Iteration (1.0p)

Q: Rewrite the Value Iteration update (Eq. 4.10) in terms of Q-values.

Solution

The standard Value Iteration update:

Q-Value Iteration Update

Why This Works

Instead of iterating over , we iterate over directly. The in the target corresponds to acting optimally from the next state — same Bellman optimality structure, just applied to action values.


Question 2c: Policy Evaluation for Q (1.0p)

Q: Rewrite the policy evaluation update (Eq. 4.4) to compute instead of . The answer should not contain .

Solution

Policy Evaluation for Q

The inner sum replaces what would be , using the relationship .


Question 2d: Policy Improvement for Q (1.0p)

Q: Rewrite the Policy Improvement step (p.80) in terms of instead of .

Solution

Policy Improvement with Q

Why Q is Easier

With : — requires knowing . With : Just take the argmax — no model needed! This is why Q-values are preferred for model-free control.


Question 2e: Policy Evaluation Differences (1.0p)

Q: The policy evaluation step on p.80 is different from the separate algorithm on p.75. What’s the difference and why?

Solution

  • Page 75 (standalone policy evaluation): Iterates until convergence (). Runs many sweeps to get an accurate .
  • Page 80 (within policy iteration): May stop after a single sweep (or a few). This is because full convergence is unnecessary — even an approximate evaluation leads to policy improvement.

This is the GPI Idea

Generalized Policy Iteration: you don’t need perfect evaluation before improving. Any amount of evaluation + improvement progress drives you toward optimality. Value iteration is the extreme: one sweep of evaluation per improvement step.


Question 2f: Bellman as Linear System (2.0p)

Q: For an MDP with two states, write the Bellman equations as a linear system . What are and ?

Solution

The Bellman Equation for a given policy (deterministic for simplicity):

For two states, expanding:

Rearranging ():

Linear System

In compact form: and .

This Generalizes

For states: . This is the closed-form solution to the Bellman equation. Only practical for small state spaces (matrix inversion is ). See also LSTD which exploits this structure with function approximation.