RL-CA02: Coding Assignment 2 — Monte Carlo Methods

Overview

Implementation of Monte Carlo prediction and control methods on a Blackjack environment.

Files:

  • MC_lab.ipynb — Main notebook
  • blackjack.py — Blackjack environment
  • mc_autograde.py — Autograding tests

What You Implement

  1. On-policy MC prediction: First-visit MC to estimate and
  2. On-policy MC control: With ε-greedy policy improvement
  3. Off-policy MC prediction: Using ordinary importance sampling

Key Implementation Details

First-Visit MC Prediction

# For each episode:
# 1. Generate episode following pi
# 2. Walk backwards through episode
# 3. Compute returns G, update V(s) with running average
G = 0
for t in reversed(range(len(episode))):
    s, a, r = episode[t]
    G = gamma * G + r
    if s not in [x[0] for x in episode[:t]]:  # first-visit check
        N[s] += 1
        V[s] += (G - V[s]) / N[s]  # incremental average

Off-policy with Ordinary IS

Key: importance sampling ratio for episode from :

Incremental update (from HW2 Q1a):

Key Takeaways

  • MC is model-free — doesn’t need transition probabilities
  • First-visit MC: simpler, unbiased
  • Ordinary IS: unbiased but high variance (visible in the plots)
  • Weighted IS: biased but much lower variance (smoother convergence)

See RL-HW02 - Homework 2 for theoretical questions.