RL-L03: Monte Carlo Methods
Overview
Monte Carlo (MC) methods learn value functions and optimal policies from experience in the form of sample episodes.
Key Characteristics
- Model-free: Unlike Dynamic Programming, MC does not require knowledge of MDP dynamics ().
- Averages Returns: Estimates are based on averaging sample returns for each state-action pair.
- Episodic Tasks: Defined only for episodic tasks as it requires the completion of an episode to calculate the return .
- No Bootstrapping: MC methods do not update estimates based on other estimates; they use actual sampled returns.
MC vs. DP Comparison
| Feature | Dynamic Programming | Monte Carlo Methods |
|---|---|---|
| Model | Needs full | Model-free (Sample experience) |
| Bootstrapping | Yes (Updates based on next state values) | No (Updates based on returns) |
| Width | Full (Expectation over all transitions) | Single (Sample trajectory) |
| Depth | 1-step lookahead | Full (Until end of episode) |
1. Monte Carlo Prediction
The goal of prediction is to estimate the state-value function under a fixed policy .
The Return
For a trajectory , the return is: By the law of large numbers, the average return converges to the expected value:
First-Visit vs. Every-Visit MC
- First-Visit MC: Averages returns only from the first time a state is visited in each episode.
- Each return is an i.i.d. estimate of .
- Convergence is .
- Every-Visit MC: Averages returns from all visits to in each episode.
- Estimates are not independent but also converge to quadratically.
First-Visit MC Prediction Algorithm
Algorithm: First-visit MC prediction
Input: a policy to be evaluated Initialize: arbitrarily, empty list Loop forever (for each episode):
- Generate an episode following :
- Loop backwards :
- Unless appears in :
- Append to
2. Blackjack Example
Blackjack is a classic episodic MDP used to illustrate MC prediction.
- Objective: Maximize card sum .
- State Space:
- Current sum (12-21)
- Dealer’s showing card (Ace-10)
- Usable Ace (Yes/No)
- Total: 200 states.
- Rewards: +1 for win, -1 for loss, 0 for draw.
- Action: Hit or Stick.
- Policy Evaluation: Average returns over thousands of simulated games (episodes).
- Observation: States with usable aces are less frequent and thus have higher variance in the value function estimate.
3. Monte Carlo Control
Control aims to approximate optimal policies using Generalized Policy Iteration (GPI).
Action Values ()
Without a model, state values are insufficient for control (cannot look ahead). We must estimate action-value functions .
The Exploration-Exploitation Dilemma
Many state-action pairs might never be visited if is deterministic. Two solutions:
- Exploring Starts: Assume every episode starts at a random state-action pair with non-zero probability.
- On-Policy: Use -greedy policies.
- Off-Policy: Use a separate behavior policy to explore.
Algorithm: Monte Carlo ES (Exploring Starts)
This algorithm alternates between evaluation and improvement episode-by-episode.
Algorithm: Monte Carlo ES
Initialize: arbitrarily, arbitrarily, empty Loop forever:
- Choose such that all pairs have probability (Exploring Starts)
- Generate episode from following :
- Loop backwards :
- Unless appeared earlier in the episode:
- Append to
4. On-Policy MC Control (-greedy)
Avoids exploring starts by using a soft policy (e.g., -greedy).
-greedy Improvement
For an -soft policy , an -greedy policy wrt is an improvement ().
Proof Idea (PIT):
5. Off-Policy Prediction and Control
Learn about a target policy while following a behavior policy ().
Coverage Assumption
The behavior policy must be able to take any action that might take:
Importance Sampling (IS) Ratio
To transform expectations from to , we weight returns by the probability of the trajectory occurring under vs. : Note: Transition dynamics cancel out!
Types of Importance Sampling
- Ordinary IS: Simple average of scaled returns.
- Unbiased, but can have infinite variance.
- Weighted IS: Weighted average of scaled returns.
- Biased (bias as ), but finite variance.
Algorithm: Off-policy MC Control
Algorithm: Off-policy MC Control
Initialize: arbitrarily, , Loop forever:
- Select soft behavior policy ; Generate episode following :
- Loop backwards :
- If then exit inner loop
6. Incremental Implementation
Weighted IS can be implemented incrementally to avoid storing all returns. Given with weights :
7. Diagrams
Backup Diagram: MC Prediction
(S_t) <-- Root (state to update)
|
[A_t, R_{t+1}]
|
(S_{t+1})
|
[A_{t+1}, R_{t+2}]
|
...
|
((T)) <-- Terminal (end of episode)
Contrast with DP: MC looks at a single, full trajectory.
Summary Key Points
- MC learns from experience, avoiding the need for environment models.
- Goal: Average returns to estimate expectations.
- GPI applies: use evaluation (averaging) and improvement (greedy/-greedy).
- Off-policy requires Importance Sampling to account for different behavior.
- Variance is the main challenge in MC, especially in Off-policy IS.