Monte Carlo Methods

Definition

Monte Carlo Methods

Monte Carlo (MC) methods learn value functions and optimal policies from complete episodes of experience. They estimate values by averaging actual returns — no model of the environment required, and no bootstrapping.

Key Idea

The Core Insight

To estimate : play many episodes, and every time you visit state , record the return you got from there. The average of those returns converges to the true value. That’s it. No equations to solve, no model needed — just sample and average.

MC Prediction

First-Visit MC

Averages returns only from the first visit to each state within an episode.

  • Unbiased estimator of
  • Converges by the Law of Large Numbers

Every-Visit MC

Averages returns from every visit to each state within an episode.

  • Also converges, but individual samples within an episode are correlated
  • Slightly biased but asymptotically unbiased

MC Control

MC control learns (not ) because action values allow model-free policy improvement.

With Exploring Starts

  • Every state-action pair has non-zero probability of being the episode start
  • Guarantees all pairs are visited → can learn optimal policy
  • Unrealistic in practice

On-Policy (ε-Greedy)

  • Uses Epsilon-Greedy Policy to ensure exploration
  • Converges to the best ε-soft policy (not truly optimal, but close for small ε)

Off-Policy with Importance Sampling

  • Generate data with behavior policy , learn about target policy
  • Correct distribution mismatch using importance sampling ratios
  • Ordinary IS: unbiased but high variance
  • Weighted IS: biased (for finite samples) but much lower variance

Comparison with Other Methods

PropertyMCTDDP
Model-free
Bootstraps
Complete episodes neededN/A
Unbiased value estimatesN/A
VarianceHighLowN/A
Works for non-Markov

Key Properties

  • No bootstrapping → unbiased but higher variance than TD
  • Episode-based → can’t be used for continuing tasks (no termination = no return)
  • State estimates independent → updating doesn’t affect
  • Handles non-Markov environments (doesn’t rely on Markov property)

Connections

Appears In