RL Exercise Set Week 2: Monte Carlo & TD Learning

3 Monte Carlo methods

3.1 Monte Carlo

Exercise 1: MC Estimation

Consider an MDP with a single state $s_{0}$ that has a certain probability of transitioning back onto itself with a reward of 0, and will otherwise terminate with a reward of 3. Your agent has interacted with the environment and has gotten the following two sequences of rewards obtained: $[0, 0, 3]$ , $[0, 0, 0, 3]$ . Use $γ = 0.8$ .

(a) Estimate the value of $s_{0}$ using first-visit MC. (b) Estimate the value of $s_{0}$ using every-visit MC.

Concepts Tested: [[First-Visit MC]], [[Monte Carlo Methods]]

Solution:

First, let’s calculate the returns ( $G_{t}$ ) for each visit to $s_{0}$ . The return $G_{t}$ is defined as $G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$ .

Sequence 1: $[0, 0, 3]$ (Transitions: $s_{0} 0 s_{0} 0 s_{0} 3 T er mina l$ )
- Visit 1 (at $t = 0$ ): $G_{0} = 0 + 0.8 \cdot 0 + 0. 8^{2} \cdot 3 = 1.92$
- Visit 2 (at $t = 1$ ): $G_{1} = 0 + 0.8 \cdot 3 = 2.4$
- Visit 3 (at $t = 2$ ): $G_{2} = 3.0$
Sequence 2: $[0, 0, 0, 3]$ (Transitions: $s_{0} 0 s_{0} 0 s_{0} 0 s_{0} 3 T er mina l$ )
- Visit 1: $G_{0} = 0 + 0.8 \cdot 0 + 0. 8^{2} \cdot 0 + 0. 8^{3} \cdot 3 = 1.536 \approx 1.54$
- Visit 2: $G_{1} = 0 + 0.8 \cdot 0 + 0. 8^{2} \cdot 3 = 1.92$
- Visit 3: $G_{2} = 0 + 0.8 \cdot 3 = 2.4$
- Visit 4: $G_{3} = 3.0$

(a) First-visit MC: We only take the return from the first time $s_{0}$ is visited in each episode.

Episode 1: $G = 1.92$
Episode 2: $G = 1.54$ $V (s_{0}) \approx \frac{1.92 + 1.54}{2} = 1.73$

(b) Every-visit MC: We average the returns from every visit to $s_{0}$ across both episodes.

Total visits = $3 + 4 = 7$
Sum of returns = $(1.92 + 2.4 + 3.0) + (1.54 + 1.92 + 2.4 + 3.0) = 16.18$ $V (s_{0}) \approx \frac{16.18}{7} \approx 2.31$

3.2 Bias of $v_{π}$ Monte Carlo estimators

Exercise 1: Importance Sampling

Comment on the bias of weighted importance sampling compared to ordinary importance sampling. Why might we nevertheless use weighted importance sampling?

Concepts Tested: [[Importance Sampling]]

Solution: Ordinary importance sampling is unbiased, while weighted importance sampling is biased (though the bias converges to zero as the number of samples increases). However, weighted importance sampling is preferred because it significantly reduces variance. In ordinary importance sampling, the variance can be unbounded if the importance ratios are large (e.g., a rare action in the behavior policy is common in the target policy).

Exercise 2: Unbiasedness of Single Episode MC

Consider one episode following $π$ : $(S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, \dots, S_{T - 1}, A_{T - 1}, R_{T})$ , where $S_{0} = s$ . Determine and provide intuition on the biasedness of the following estimator for $v_{π} (s)$ : $\sum_{i = 1}^{T} γ^{i - 1} R_{i}$

Concepts Tested: [[Monte Carlo Methods]]

Solution: This estimator is unbiased. By definition: $v_{π} (s) = E_{π} [G_{0} ∣ S_{0} = s]$ . The sum $\sum_{i = 1}^{T} γ^{i - 1} R_{i}$ is exactly the return $G_{0}$ for an episode starting in state $s$ at $t = 0$ . Since the episode follows policy $π$ , its expectation is exactly the value function $v_{π} (s)$ .

Exercise 3: Every-visit MC Bias

Determine the biasedness of: $\frac{1}{∣ J ∣} \sum_{j \in J} \sum_{i = 1}^{T - j} γ^{i - 1} R_{j + i}$ where $J$ contains all indices $j$ such that $S_{j} = s$ .

Concepts Tested: [[Monte Carlo Methods]]

Solution: This is the every-visit MC estimator. It is biased. Intuition: For any visit after the first, the corresponding return $G_{j}$ is a sample from the distribution of returns conditioned on the fact that the state $s$ has already been visited earlier in the trajectory. This conditioning restricts the sample space and induces a bias relative to the true value function $v_{π} (s)$ , which is the unconditional expectation of returns from state $s$ .

Exercise 4: Latest-visit MC Bias

Determine the biasedness of: $\sum_{i = 1}^{T - t_{s}} γ^{i - 1} R_{t_{s} + i}$ where $t_{s}$ is the latest time step such that $S_{t_{s}} = s$ . How does this compare to the first-visit MC estimator?

Solution: This estimator is biased for the same reasons as every-visit MC. If $t_{s}$ is not the first visit, it is a conditional sample. It only becomes unbiased in the specific case where $s$ is visited exactly once in the episode (making it identical to a first-visit sample).

3.3 * Exam question: Monte Carlo for control

Exam-Style Question

The following questions refer to the pseudo-code for Off-policy MC Control (Figure 1).

Part of the algorithm is covered by a black square (the inner loop range). What is the missing information?

What is stored in $C (S_{t}, A_{t})$ ?

Why is the inner loop stopped when $A_{t} \neq = π (S_{t})$ ?

Concepts Tested: [[Monte Carlo Control]], [[Importance Sampling]]

ASCII Reproduction of Figure 1: Off-policy MC Control

Initialize, for all s ∈ S, a ∈ A(s):
  Q(s,a) ∈ R (arbitrarily)
  C(s,a) ← 0
  π(s) ← argmax_a Q(s,a)    (with ties broken consistently)
 
Loop forever (for each episode):
  b ← any soft policy
  Generate an episode using b: S0, A0, R1, ..., ST-1, AT-1, RT
  G ← 0
  W ← 1
  Loop for each step of episode, t = T-1, T-2, ... 0:  <-- [BLACK SQUARE]
    G ← G + Rt+1
    C(St, At) ← C(St, At) + W
    Q(St, At) ← Q(St, At) + (W / C(St, At)) [ G − Q(St, At) ]
    π(St) ← argmax_a Q(St, a)    (with ties broken consistently)
    If At ≠ π(St) then exit inner loop
    W ← W * 1 / b(At | St)

Solution:

The missing range is $t = T - 1, T - 2, \dots, 0$ (working backwards from the end of the episode).
$C (S_{t}, A_{t})$ stores the cumulative importance weights of all visits to that state-action pair across all episodes. It acts as the denominator for the weighted average.
The inner loop stops because we are evaluating and improving a greedy policy $π$ . If an action $A_{t}$ taken by the behavior policy $b$ is not the action that the greedy target policy would have taken, the probability of that action under $π$ is 0. Since the importance sampling weight involves the ratio $\frac{π ( A _{t} ∣ S _{t} )}{b ( A _{t} ∣ S _{t} )}$ , subsequent weights for the rest of the episode would become 0.

4 Temporal Difference Learning

4.1 Temporal Difference Learning (application)

Exercise 1: TD, SARSA, Q-Learning Trace

Consider an undiscounted MDP with states $A, B$ and terminal state $T$ ( $V (T) = 0$ ). Observed episode: $A a = 1 r = - 3 B a = 1 r = 4 A a = 2 r = - 4 A a = 1 r = - 3 T$

Parameters: $γ = 1, α = 0.1$ , initial values = 0.

Calculate final estimates for: (a) TD(0) (b) SARSA (c) Q-learning

Concepts Tested: [[Temporal Difference Learning]], [[SARSA]], [[Q-Learning]], [[TD Error]]

Solution:

(a) TD(0): Update rule: $V (S_{t}) \leftarrow V (S_{t}) + α [R_{t + 1} + γV (S_{t + 1}) - V (S_{t})]$

$A \to B$ ( $r = - 3$ ): $V (A) = 0 + 0.1 [- 3 + 0 - 0] = - 0.3$
$B \to A$ ( $r = 4$ ): $V (B) = 0 + 0.1 [4 + (- 0.3) - 0] = 0.37$
$A \to A$ ( $r = - 4$ ): $V (A) = - 0.3 + 0.1 [- 4 + (- 0.3) - (- 0.3)] = - 0.7$
$A \to T$ ( $r = - 3$ ): $V (A) = - 0.7 + 0.1 [- 3 + 0 - (- 0.7)] = - 0.93$

Final: $V (A) = - 0.93, V (B) = 0.37$

(b) SARSA: Update rule: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t})]$

$(A, 1) \to (B, 1)$ ( $r = - 3$ ): $Q (A, 1) = 0 + 0.1 [- 3 + 0 - 0] = - 0.3$
$(B, 1) \to (A, 2)$ ( $r = 4$ ): $Q (B, 1) = 0 + 0.1 [4 + 0 - 0] = 0.4$
$(A, 2) \to (A, 1)$ ( $r = - 4$ ): $Q (A, 2) = 0 + 0.1 [- 4 + (- 0.3) - 0] = - 0.43$
- Note: uses $Q (A, 1) = - 0.3$ for the next state-action $A_{t + 1}$ .
$(A, 1) \to T$ ( $r = - 3$ ): $Q (A, 1) = - 0.3 + 0.1 [- 3 + 0 - (- 0.3)] = - 0.57$

Final: $Q (A, 1) = - 0.57, Q (A, 2) = - 0.43, Q (B, 1) = 0.4, Q (B, 2) = 0$

(c) Q-Learning: Update rule: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ max_{a} Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$

$(A, 1) \to B$ ( $r = - 3$ ): $Q (A, 1) = 0 + 0.1 [- 3 + max (0, 0) - 0] = - 0.3$
$(B, 1) \to A$ ( $r = 4$ ): $Q (B, 1) = 0 + 0.1 [4 + max (0, 0) - 0] = 0.4$
$(A, 2) \to A$ ( $r = - 4$ ): $Q (A, 2) = 0 + 0.1 [- 4 + max (- 0.3, 0) - 0] = - 0.4$
$(A, 1) \to T$ ( $r = - 3$ ): $Q (A, 1) = - 0.3 + 0.1 [- 3 + 0 - (- 0.3)] = - 0.57$

Final: $Q (A, 1) = - 0.57, Q (A, 2) = - 0.4, Q (B, 1) = 0.4, Q (B, 2) = 0$

4.2 Temporal Difference Learning (Theory)

Exercise 1: Incremental MC Update

Show that the average return $V_{M} (S) = \frac{1}{M} \sum_{n = 1}^{M} G_{n} (S)$ can be written in the incremental update form: $V_{M} (S) = V_{M - 1} (S) + α_{M} [G_{M} (S) - V_{M - 1} (S)]$ Identify the learning rate $α_{M}$ .

Concepts Tested: [[Monte Carlo Methods]]

Solution: $V_{M} (S) = \frac{1}{M} \sum_{n = 1}^{M} G_{n} (S)$ $V_{M} (S) = \frac{1}{M} [G_{M} (S) + \sum_{n = 1}^{M - 1} G_{n} (S)]$ Since $V_{M - 1} (S) = \frac{1}{M - 1} \sum_{n = 1}^{M - 1} G_{n} (S)$ , we have $\sum_{n = 1}^{M - 1} G_{n} (S) = (M - 1) V_{M - 1} (S)$ . $V_{M} (S) = \frac{1}{M} [G_{M} (S) + (M - 1) V_{M - 1} (S)]$ $V_{M} (S) = \frac{1}{M} G_{M} (S) + \frac{M - 1}{M} V_{M - 1} (S)$ $V_{M} (S) = \frac{1}{M} G_{M} (S) + (1 - \frac{1}{M}) V_{M - 1} (S)$ $V_{M} (S) = V_{M - 1} (S) + \frac{1}{M} [G_{M} (S) - V_{M - 1} (S)]$ The learning rate is $α_{M} = \frac{1}{M}$ .

Exercise 2: Expected TD Error

Consider the TD-error $δ_{t} = R_{t + 1} + γV (S_{t + 1}) - V (S_{t})$ .

(a) What is $E [δ_{t} ∣ S_{t} = s]$ if $δ_{t}$ uses the true state-value function $v_{π}$ ? (b) What is $E [δ_{t} ∣ S_{t} = s, A_{t} = a]$ if $δ_{t}$ uses the true state-value function $v_{π}$ ?

Concepts Tested: [[TD Error]]

Solution:

(a) Given $S_{t} = s$ : $E [δ_{t} ∣ S_{t} = s] = E [R_{t + 1} + γ v_{π} (S_{t + 1}) - v_{π} (S_{t}) ∣ S_{t} = s]$ $E [δ_{t} ∣ S_{t} = s] = E [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s] - v_{π} (s)$ By the Bellman Equation, $E [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s] = v_{π} (s)$ . $E [δ_{t} ∣ S_{t} = s] = v_{π} (s) - v_{π} (s) = 0$

(b) Given $S_{t} = s, A_{t} = a$ : $E [δ_{t} ∣ S_{t} = s, A_{t} = a] = E [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a] - v_{π} (s)$ The first term is the definition of the action-value function $q_{π} (s, a)$ . $E [δ_{t} ∣ S_{t} = s, A_{t} = a] = q_{π} (s, a) - v_{π} (s)$ This result is known as the Advantage Function $A (s, a)$ .

Study Notes

Explorer

RL-ES02 - Exercise Set Week 2

RL Exercise Set Week 2: Monte Carlo & TD Learning

3 Monte Carlo methods

3.1 Monte Carlo

3.2 Bias of $v_{π}$ Monte Carlo estimators

3.3 * Exam question: Monte Carlo for control

4 Temporal Difference Learning

4.1 Temporal Difference Learning (application)

4.2 Temporal Difference Learning (Theory)

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

RL-ES02 - Exercise Set Week 2

RL Exercise Set Week 2: Monte Carlo & TD Learning

3 Monte Carlo methods

3.1 Monte Carlo

3.2 Bias of vπ​ Monte Carlo estimators

3.3 * Exam question: Monte Carlo for control

4 Temporal Difference Learning

4.1 Temporal Difference Learning (application)

4.2 Temporal Difference Learning (Theory)

Graph View

Table of Contents

Backlinks

3.2 Bias of $v_{π}$ Monte Carlo estimators