Consider an MDP with a single state s0 that has a certain probability of transitioning back onto itself with a reward of 0, and will otherwise terminate with a reward of 3. Your agent has interacted with the environment and has gotten the following two sequences of rewards obtained: [0,0,3], [0,0,0,3]. Use γ=0.8.
(a) Estimate the value of s0 using first-visit MC.
(b) Estimate the value of s0 using every-visit MC.
Concepts Tested:[[First-Visit MC]], [[Monte Carlo Methods]]
Solution:
First, let’s calculate the returns (Gt) for each visit to s0.
The return Gt is defined as Gt=∑k=0∞γkRt+k+1.
(a) First-visit MC:
We only take the return from the first times0 is visited in each episode.
Episode 1: G=1.92
Episode 2: G=1.54V(s0)≈21.92+1.54=1.73
(b) Every-visit MC:
We average the returns from every visit to s0 across both episodes.
Total visits = 3+4=7
Sum of returns = (1.92+2.4+3.0)+(1.54+1.92+2.4+3.0)=16.18V(s0)≈716.18≈2.31
3.2 Bias of vπ Monte Carlo estimators
Exercise 1: Importance Sampling
Comment on the bias of weighted importance sampling compared to ordinary importance sampling. Why might we nevertheless use weighted importance sampling?
Concepts Tested:[[Importance Sampling]]
Solution:
Ordinary importance sampling is unbiased, while weighted importance sampling is biased (though the bias converges to zero as the number of samples increases). However, weighted importance sampling is preferred because it significantly reduces variance. In ordinary importance sampling, the variance can be unbounded if the importance ratios are large (e.g., a rare action in the behavior policy is common in the target policy).
Exercise 2: Unbiasedness of Single Episode MC
Consider one episode following π: (S0,A0,R1,S1,A1,R2,…,ST−1,AT−1,RT), where S0=s. Determine and provide intuition on the biasedness of the following estimator for vπ(s):
∑i=1Tγi−1Ri
Concepts Tested:[[Monte Carlo Methods]]
Solution:
This estimator is unbiased.
By definition: vπ(s)=Eπ[G0∣S0=s].
The sum ∑i=1Tγi−1Ri is exactly the return G0 for an episode starting in state s at t=0. Since the episode follows policy π, its expectation is exactly the value function vπ(s).
Exercise 3: Every-visit MC Bias
Determine the biasedness of:
∣J∣1∑j∈J∑i=1T−jγi−1Rj+i
where J contains all indices j such that Sj=s.
Concepts Tested:[[Monte Carlo Methods]]
Solution:
This is the every-visit MC estimator. It is biased.
Intuition: For any visit after the first, the corresponding return Gj is a sample from the distribution of returns conditioned on the fact that the state s has already been visited earlier in the trajectory. This conditioning restricts the sample space and induces a bias relative to the true value function vπ(s), which is the unconditional expectation of returns from state s.
Exercise 4: Latest-visit MC Bias
Determine the biasedness of:
∑i=1T−tsγi−1Rts+i
where ts is the latest time step such that Sts=s. How does this compare to the first-visit MC estimator?
Solution:
This estimator is biased for the same reasons as every-visit MC. If ts is not the first visit, it is a conditional sample. It only becomes unbiased in the specific case where s is visited exactly once in the episode (making it identical to a first-visit sample).
3.3 * Exam question: Monte Carlo for control
Exam-Style Question
The following questions refer to the pseudo-code for Off-policy MC Control (Figure 1).
Part of the algorithm is covered by a black square (the inner loop range). What is the missing information?
What is stored in C(St,At)?
Why is the inner loop stopped when At=π(St)?
Concepts Tested:[[Monte Carlo Control]], [[Importance Sampling]]
ASCII Reproduction of Figure 1: Off-policy MC Control
Initialize, for all s ∈ S, a ∈ A(s): Q(s,a) ∈ R (arbitrarily) C(s,a) ← 0 π(s) ← argmax_a Q(s,a) (with ties broken consistently)Loop forever (for each episode): b ← any soft policy Generate an episode using b: S0, A0, R1, ..., ST-1, AT-1, RT G ← 0 W ← 1 Loop for each step of episode, t = T-1, T-2, ... 0: <-- [BLACK SQUARE] G ← G + Rt+1 C(St, At) ← C(St, At) + W Q(St, At) ← Q(St, At) + (W / C(St, At)) [ G − Q(St, At) ] π(St) ← argmax_a Q(St, a) (with ties broken consistently) If At ≠ π(St) then exit inner loop W ← W * 1 / b(At | St)
Solution:
The missing range is t=T−1,T−2,…,0 (working backwards from the end of the episode).
C(St,At) stores the cumulative importance weights of all visits to that state-action pair across all episodes. It acts as the denominator for the weighted average.
The inner loop stops because we are evaluating and improving a greedy policyπ. If an action At taken by the behavior policy b is not the action that the greedy target policy would have taken, the probability of that action under π is 0. Since the importance sampling weight involves the ratio b(At∣St)π(At∣St), subsequent weights for the rest of the episode would become 0.
4 Temporal Difference Learning
4.1 Temporal Difference Learning (application)
Exercise 1: TD, SARSA, Q-Learning Trace
Consider an undiscounted MDP with states A,B and terminal state T (V(T)=0).
Observed episode: Aa=1r=−3Ba=1r=4Aa=2r=−4Aa=1r=−3T
Parameters: γ=1,α=0.1, initial values = 0.
Calculate final estimates for:
(a) TD(0)
(b) SARSA
(c) Q-learning
Show that the average return VM(S)=M1∑n=1MGn(S) can be written in the incremental update form:
VM(S)=VM−1(S)+αM[GM(S)−VM−1(S)]
Identify the learning rate αM.
Concepts Tested:[[Monte Carlo Methods]]
Solution:VM(S)=M1∑n=1MGn(S)VM(S)=M1[GM(S)+∑n=1M−1Gn(S)]
Since VM−1(S)=M−11∑n=1M−1Gn(S), we have ∑n=1M−1Gn(S)=(M−1)VM−1(S).
VM(S)=M1[GM(S)+(M−1)VM−1(S)]VM(S)=M1GM(S)+MM−1VM−1(S)VM(S)=M1GM(S)+(1−M1)VM−1(S)VM(S)=VM−1(S)+M1[GM(S)−VM−1(S)]
The learning rate is αM=M1.
Exercise 2: Expected TD Error
Consider the TD-error δt=Rt+1+γV(St+1)−V(St).
(a) What is E[δt∣St=s] if δt uses the true state-value function vπ?
(b) What is E[δt∣St=s,At=a] if δt uses the true state-value function vπ?
Concepts Tested:[[TD Error]]
Solution:
(a) Given St=s:E[δt∣St=s]=E[Rt+1+γvπ(St+1)−vπ(St)∣St=s]E[δt∣St=s]=E[Rt+1+γvπ(St+1)∣St=s]−vπ(s)
By the Bellman Equation, E[Rt+1+γvπ(St+1)∣St=s]=vπ(s).
E[δt∣St=s]=vπ(s)−vπ(s)=0
(b) Given St=s,At=a:E[δt∣St=s,At=a]=E[Rt+1+γvπ(St+1)∣St=s,At=a]−vπ(s)
The first term is the definition of the action-value function qπ(s,a).
E[δt∣St=s,At=a]=qπ(s,a)−vπ(s)
This result is known as the Advantage FunctionA(s,a).