RL-ES03: Exercise Set Week 3 — Advanced TD & Approximation

Chapter 5: From Tabular Learning to Approximation

5.1 Off-Policy TD

Setup: MDP with states $s_{1}, s_{2}$ and actions $a_{1}, a_{2}$ . Behavior policy $b$ : uniform (0.5/0.5). Target policy $π$ : $π (a_{1} ∣ s) = 0.1$ , $π (a_{2} ∣ s) = 0.9$ . Undiscounted ( $γ = 1$ ).

Q5.1.1: Calculate $Q^{b}$ and $Q^{π}$

Solution

Using value iteration (start from terminal, work backwards):

$Q (s_{1}, a_{1}) = - 1$ , $Q (s_{2}, a_{1}) = - 1$ , $Q (s_{2}, a_{2}) = + 1$ (same for both policies)

For $Q (s_{1}, a_{2})$ : $Q^{b} (s_{1}, a_{2}) = 0.5 \cdot Q (s_{2}, a_{1}) + 0.5 \cdot Q (s_{2}, a_{2}) = 0.5 (- 1) + 0.5 (+ 1) = 0$ $Q^{π} (s_{1}, a_{2}) = 0.1 \cdot Q (s_{2}, a_{1}) + 0.9 \cdot Q (s_{2}, a_{2}) = 0.1 (- 1) + 0.9 (+ 1) = 0.8$

Q5.1.2: One Pass of SARSA ( $α = 0.1$ )

Data: $(s_{1}, a_{2}, 0, s_{2}, a_{1}, - 1)$ , $(s_{1}, a_{2}, 0, s_{2}, a_{2}, + 1)$

Initial Q-table:

	$a_{1}$	$a_{2}$
$s_{1}$	-1	0.5
$s_{2}$	-1	+1

Key Insight

Only $Q (s_{1}, a_{2})$ changes — all other Q-values already equal their target values.

First transition $(s_{1}, a_{2}, 0, s_{2}, a_{1})$ :

Target: $R + Q (s_{2}, a_{1}) = 0 + (- 1) = - 1$
Update: $Q (s_{1}, a_{2}) = 0.5 + 0.1 (- 1 - 0.5) = 0.5 - 0.15 = 0.35$

Second transition $(s_{1}, a_{2}, 0, s_{2}, a_{2})$ :

Target: $R + Q (s_{2}, a_{2}) = 0 + 1 = + 1$
Update: $Q (s_{1}, a_{2}) = 0.35 + 0.1 (1 - 0.35) = 0.35 + 0.065 = 0.415$

On-policy SARSA moves Q toward $Q^{b}$

The update pushes $Q (s_{1}, a_{2})$ from 0.5 toward 0 (which is $Q^{b} (s_{1}, a_{2})$ ). Repeated passes would converge to $Q^{b}$ .

Q5.1.3: SARSA with Importance Weights

First transition: IS ratio $ρ = π (a_{1} ∣ s_{2}) / b (a_{1} ∣ s_{2}) = 0.1/0.5 = 0.2$

Update: $0.1 \times 0.2 \times (- 1 - 0.5) = - 0.03$
$Q (s_{1}, a_{2}) = 0.5 - 0.03 = 0.47$

Second transition: IS ratio $ρ = π (a_{2} ∣ s_{2}) / b (a_{2} ∣ s_{2}) = 0.9/0.5 = 1.8$

Update: $0.1 \times 1.8 \times (1 - 0.47) = 0.095$
$Q (s_{1}, a_{2}) = 0.47 + 0.095 = 0.565$

Off-policy IS-SARSA moves Q toward $Q^{π}$

Now the update pushes toward $0.8$ (which is $Q^{π} (s_{1}, a_{2})$ ). The importance weights correct the distribution.

Q5.1.4: Why is Q-Learning Off-Policy?

Answer

In Q-Learning, the target policy (greedy: $max_{a} Q (s^{'}, a)$ ) differs from the behavior policy (e.g., ε-greedy). The update uses $max_{a}$ regardless of which action was actually taken — learning about the optimal policy while following an exploratory one.

Q5.1.5: Q-Learning vs IS-SARSA (Greedy Target)

Both converge to the same $Q^{*}$ , but IS-SARSA wastes samples when $b$ and $π$ disagree (ratio = 0 for off-greedy actions), and has variance issues when ratio is large. Q-learning is preferred — it implicitly handles the off-policy correction through the max operator.

Q5.1.6: Why Not Off-Policy V-Learning?

Off-policy V-learning (TD(0) with IS) is possible but less useful because:

In prediction: we usually want $v^{b}$ (evaluate current behavior), so off-policy isn’t needed
In control: off-policy is important, but V-functions require a model for policy improvement ( $π (s) = ar g max_{a} \sum_{s^{'}} p (s^{'} ∣ s, a) [r + γV (s^{'})]$ ). Q-functions don’t need the model.

Q5.1.7: Q-Learning for V Functions?

No. Q-learning works by taking $max_{a}$ over targets. For $V (s)$ , we’d need $max_{a} [R (s, a) + γV (s^{'})]$ — but we only observe the reward and next state for the action actually taken. We don’t have data for all actions from each state. In Q-learning, each $(s, a)$ is stored separately, so this isn’t an issue.

5.2 Function Approximation and State Distribution

Q5.2.1-3: $μ (s)$ Dependence on Parameters

$μ (s)$ depends on the policy, which depends on the value function approximator’s parameters $w$ . Changing $w$ → changes $π$ → changes which states are visited → changes $μ$ .
In supervised learning, the data distribution is fixed and independent of model parameters. In RL, the data distribution changes as the agent learns.
This means the weighting in the $\overline{V E}$ objective ( $\sum_{s} μ (s) [...]^{2}$ ) is itself non-stationary — the states we care most about change as we learn.

Chapter 6: On-Policy TD with Approximation

6.1 On-Policy Distributions and LSTD

Setup: 2-state MDP with $γ = 2/3$ , features $ϕ (s_{1}) = 2$ , $ϕ (s_{2}) = 1$ . Initial distribution $p_{0} = (1/3, 2/3)$ . Transitions: from each state, $p = 1/2$ to $s_{1}$ , $p = 1/2$ to terminal/other. Rewards: $r = 6$ from $s_{1}$ (one transition), $r = 2$ from $s_{2}$ (one transition).

Q6.1.1: On-Policy Distribution $μ$

Solution

Solve $h = p_{0} + γ P^{⊤} h$ :

$(1 - γ /2 - γ /2 - γ /2 1) (h_{1} h_{2}) = (2/3 1/3)$

Solution: $h = (7/5, 4/5)$

Normalize: $μ = (7/11, 4/11)$

Q6.1.2: Transition Frequencies

From $s_{1}$ : each transition occurs with frequency $7/11 \times 1/2 = 7/22$
From $s_{2}$ : each transition occurs with frequency $4/11 \times 1/2 = 4/22$

Q6.1.3: LSTD Solution

LSTD Computation

Weight each transition by its frequency: $\hat{A} = 7 \cdot ϕ (s_{1}) (ϕ (s_{1}) - γ ϕ (s_{1})) + 7 \cdot ϕ (s_{1}) (ϕ (s_{1}) - γ ϕ (s_{2})) + 4 \cdot ϕ (s_{2}) (ϕ (s_{2}) - γ ϕ (s_{1})) + 4 \cdot ϕ (s_{2}) (ϕ (s_{2}) - 0)$

Computing: $\hat{A} = 92/3$

$\hat{b} = 7 \cdot ϕ (s_{1}) \cdot 6 + 0 + 0 + 4 \cdot ϕ (s_{2}) \cdot 2 = 84 + 8 = 92$

Solution: $w = \hat{A}^{- 1} \hat{b} = 3/92 \times 92 = 3$

6.2 Basis Functions

Q6.2.1: Tabular as Special Case of Linear FA

Answer

Use one-hot feature vectors. For state $i$ in an $n$ -state MDP: $ϕ (s_{i}) = e_{i}$ (standard basis vector, 1 at position $i$ , 0 elsewhere).

Then: $\overset{v}{^} (s_{i}, w) = w^{⊤} e_{i} = w_{i}$

Each state has its own independent weight — exactly tabular RL.

Q6.2.2: Linear vs Non-Linear FA Advantages

Linear:

Easier gradient ( $\nabla \overset{v}{^} = x (s)$ )
LSTD: closed-form TD fixed point
Strong convergence guarantees

Non-Linear:

More expressive (better performance with enough data)
Automatic feature learning (no manual design)
Flexible architectures (CNNs, Transformers, etc.)

6.3 Semi-Gradient TD and the TD Fixed Point

Setup: 4-state MDP (travel costs), $γ = 1$ . Linear approximation $\overset{v}{^} (s, w) = w^{⊤} ϕ (s)$ with features: $ϕ (s_{1}) = (0, 1)$ , $ϕ (s_{2}) = (0, 2)$ , $ϕ (s_{3}) = (1, 0)$ , $ϕ (s_{4}) = (2, 0)$ , $ϕ (T) = (0, 0)$ .

Q6.3.1: Semi-Gradient Update

Given $w_{t} = (0.5, 0.5)^{⊤}$ , transition $(s_{2}, - 1, s_{4})$ , learning rate $α$ :

Solution

$w_{t + 1} = w_{t} + α [R + γ \overset{v}{^} (s_{4}, w_{t}) - \overset{v}{^} (s_{2}, w_{t})] \nabla \overset{v}{^} (s_{2}, w_{t})$

$\overset{v}{^} (s_{2}, w_{t}) = (0, 2) \cdot (0.5, 0.5)^{⊤} = 1.0$

$\overset{v}{^} (s_{4}, w_{t}) = (2, 0) \cdot (0.5, 0.5)^{⊤} = 1.0$ (with $γ = 1$ : target = $- 1 + 1.0 = 0$ … wait, actually $\overset{v}{^} (s_{4}) = 2 \cdot 0.5 + 0 \cdot 0.5 = 1.0$ )

TD error: $δ = - 1 + 1 \cdot 1.0 - 1.0 = - 1$

$\nabla \overset{v}{^} (s_{2}) = ϕ (s_{2}) = (0, 2)^{⊤}$

$w_{t + 1} = (0.5, 0.5)^{⊤} + α (- 1) (0, 2)^{⊤} = (0.5, 0.5 - 2 α)^{⊤}$

Note: The solution in the answer key gives $0.5 - 3 α$ using a slightly different interpretation of $\overset{v}{^} (s_{4})$ . Check the feature computation carefully with the specific MDP rewards.

Q6.3.2: LSTD vs Semi-Gradient TD Relationship

Key Result

LSTD finds the TD Fixed Point directly. Semi-gradient TD, if it converges, converges to the same TD fixed point. They target the same solution — LSTD computes it in closed form, semi-gradient TD converges to it iteratively.

Q6.3.3: LSTD on Given Trajectories

Trajectories: ${(s_{1}, - 1, s_{3}, - 1, T), (s_{2}, - 1, s_{4}, - 5, T)}$

Full LSTD Computation

$\hat{A} = ϕ (s_{1}) (ϕ (s_{1}) - ϕ (s_{3}))^{⊤} + ϕ (s_{3}) (ϕ (s_{3}) - ϕ (T))^{⊤} + ϕ (s_{2}) (ϕ (s_{2}) - ϕ (s_{4}))^{⊤} + ϕ (s_{4}) (ϕ (s_{4}) - ϕ (T))^{⊤}$

Computing each term (outer products): $= (0002) + (1000) + (0002) + (0001)$

Wait — let me redo with correct features. $ϕ (s_{1}) = (0, 1)$ , $ϕ (s_{2}) = (0, 2)$ , $ϕ (s_{3}) = (1, 0)$ , $ϕ (s_{4}) = (2, 0)$ :

$\hat{A} = (01) (0 - 1, 1 - 0)^{⊤} + (10) (1 - 0, 0 - 0)^{⊤} + (02) (0 - 2, 2 - 0)^{⊤} + (20) (2 - 0, 0 - 0)^{⊤}$

Following the answer key: $\hat{A} = (3003)$

$\hat{b} = (- 1) ϕ (s_{1}) + (- 1) ϕ (s_{3}) + (- 1) ϕ (s_{2}) + (- 5) ϕ (s_{4}) = (0, - 1) + (- 1, 0) + (0, - 2) + (- 10, 0) = (- 11, - 3)$

Wait, using the answer key: $\hat{b} = (- 3, - 7)^{⊤}$

$w = \hat{A}^{- 1} \hat{b} = (- 1, - 7/3)^{⊤}$

Approximate values:

$\overset{v}{^} (s_{1}) = (- 1, - 7/3) \cdot (0, 1) = - 7/3 \approx - 2.33$ …

Per the answer key: $w = (- 1, - 7/3)$ giving $\overset{v}{^} (s_{1}) = - 2$ , $\overset{v}{^} (s_{2}) = - 14/3$ , $\overset{v}{^} (s_{3}) = - 1$ , $\overset{v}{^} (s_{4}) = - 7/3$ .

Q6.3.4: Quality of the Solution

Where It Fails

The “top route” ( $s_{1} \to s_{3} \to T$ ): features capture the value well (true values: $v (s_{1}) = - 2$ , $v (s_{3}) = - 1$ ).

The “bottom route” ( $s_{2} \to s_{4} \to T$ ): features struggle. $v (s_{2})$ should be $- 6$ ( $- 1 + - 5$ ) and $v (s_{4}) = - 5$ . But the features $ϕ (s_{2}) = (0, 2)$ and $ϕ (s_{4}) = (2, 0)$ can’t independently represent these — $s_{2}$ ‘s value is tied to $w_{2}$ , which also affects $s_{1}$ .

The TD fixed point makes a trade-off, weighted by the on-policy distribution $μ$ .

Q6.3.5: “Never Forgetting” (LSTD)

The TD fixed point is a function of all data ever seen (via the $A$ and $b$ matrices). LSTD uses all past transitions equally — it “never forgets.”

Advantage: More sample efficient — no data is thrown away. Disadvantage: If the MDP or policy changes (non-stationarity), old data becomes misleading. Want to gradually forget old experience to adapt.

Q6.3.6: Neural Network Semi-Gradient Update

# For transition (s, a, r, s', a'):
val = NN_w(s)           # forward pass
val_prime = NN_w(s')    # forward pass (no grad needed)
val.backward()          # backward pass: computes ∂NN/∂w → w.grad
# Semi-gradient update:
w = w + alpha * (r + gamma * val_prime - val) * w.grad

Semi-Gradient: No Gradient Through Target

val_prime is treated as a constant (no .backward() through it). This is what makes it “semi-gradient.” The gradient only flows through the prediction $\overset{v}{^} (s)$ , not the target $R + γ \overset{v}{^} (s^{'})$ .

6.4 Preparatory Question: Off-Policy Approximation

Baird's Counterexample

A notebook exercise on Canvas demonstrates the Deadly Triad — semi-gradient TD with linear FA diverges under off-policy updates. See RL-L07 - Off-Policy RL with Approximation and Off-Policy Divergence for theory.

Study Notes

Explorer

RL-ES03 - Exercise Set Week 3

RL-ES03: Exercise Set Week 3 — Advanced TD & Approximation

Chapter 5: From Tabular Learning to Approximation

5.1 Off-Policy TD

Q5.1.1: Calculate $Q^{b}$ and $Q^{π}$

Q5.1.2: One Pass of SARSA ( $α = 0.1$ )

Q5.1.3: SARSA with Importance Weights

Q5.1.4: Why is Q-Learning Off-Policy?

Q5.1.5: Q-Learning vs IS-SARSA (Greedy Target)

Q5.1.6: Why Not Off-Policy V-Learning?

Q5.1.7: Q-Learning for V Functions?

5.2 Function Approximation and State Distribution

Q5.2.1-3: $μ (s)$ Dependence on Parameters

Chapter 6: On-Policy TD with Approximation

6.1 On-Policy Distributions and LSTD

Q6.1.1: On-Policy Distribution $μ$

Q6.1.2: Transition Frequencies

Q6.1.3: LSTD Solution

6.2 Basis Functions

Q6.2.1: Tabular as Special Case of Linear FA

Q6.2.2: Linear vs Non-Linear FA Advantages

6.3 Semi-Gradient TD and the TD Fixed Point

Q6.3.1: Semi-Gradient Update

Q6.3.2: LSTD vs Semi-Gradient TD Relationship

Q6.3.3: LSTD on Given Trajectories

Q6.3.4: Quality of the Solution

Q6.3.5: “Never Forgetting” (LSTD)

Q6.3.6: Neural Network Semi-Gradient Update

6.4 Preparatory Question: Off-Policy Approximation

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

RL-ES03 - Exercise Set Week 3

RL-ES03: Exercise Set Week 3 — Advanced TD & Approximation

Chapter 5: From Tabular Learning to Approximation

5.1 Off-Policy TD

Q5.1.1: Calculate Qb and Qπ

Q5.1.2: One Pass of SARSA (α=0.1)

Q5.1.3: SARSA with Importance Weights

Q5.1.4: Why is Q-Learning Off-Policy?

Q5.1.5: Q-Learning vs IS-SARSA (Greedy Target)

Q5.1.6: Why Not Off-Policy V-Learning?

Q5.1.7: Q-Learning for V Functions?

5.2 Function Approximation and State Distribution

Q5.2.1-3: μ(s) Dependence on Parameters

Chapter 6: On-Policy TD with Approximation

6.1 On-Policy Distributions and LSTD

Q6.1.1: On-Policy Distribution μ

Q6.1.2: Transition Frequencies

Q6.1.3: LSTD Solution

6.2 Basis Functions

Q6.2.1: Tabular as Special Case of Linear FA

Q6.2.2: Linear vs Non-Linear FA Advantages

6.3 Semi-Gradient TD and the TD Fixed Point

Q6.3.1: Semi-Gradient Update

Q6.3.2: LSTD vs Semi-Gradient TD Relationship

Q6.3.3: LSTD on Given Trajectories

Q6.3.4: Quality of the Solution

Q6.3.5: “Never Forgetting” (LSTD)

Q6.3.6: Neural Network Semi-Gradient Update

6.4 Preparatory Question: Off-Policy Approximation

Graph View

Table of Contents

Backlinks

Q5.1.1: Calculate $Q^{b}$ and $Q^{π}$

Q5.1.2: One Pass of SARSA ( $α = 0.1$ )

Q5.2.1-3: $μ (s)$ Dependence on Parameters

Q6.1.1: On-Policy Distribution $μ$