RL-CA01: Coding Assignment 1 — Dynamic Programming

Overview

Implementation of Policy Iteration and Value Iteration algorithms on a gridworld environment.

Files:

  • DP_lab.ipynb — Main notebook
  • gridworld.py — Gridworld environment
  • dp_autograde.py — Autograding tests

What You Implement

  1. Policy Evaluation: Iteratively compute for a given policy
  2. Policy Improvement: Compute greedy policy w.r.t. current
  3. Policy Iteration: Alternate evaluation + improvement until convergence
  4. Value Iteration: Single update combining both steps:

Key Implementation Details

Policy Evaluation

# Core loop: iterate until convergence
while delta > theta:
    for s in states:
        v = V[s]
        V[s] = sum(pi[s,a] * sum(p * (r + gamma * V[s_next]) 
                    for p, s_next, r, done in env.P[s][a])
                    for a in actions)
        delta = max(delta, abs(v - V[s]))

Value Iteration

# Core: combine evaluation + improvement in one step
V[s] = max(sum(p * (r + gamma * V[s_next]) 
           for p, s_next, r, done in env.P[s][a])
           for a in actions)

Key Takeaways

  • Policy iteration converges in fewer outer iterations but each is expensive (full evaluation)
  • Value iteration is simpler (one sweep per iteration) and often faster in total wall-clock time
  • Both require the full model — this is a limitation addressed by Monte Carlo Methods and Temporal Difference Learning

See RL-HW01 - Homework 1 for theoretical questions about these algorithms.