RL-ES03: Exercise Set Week 3 — Advanced TD & Approximation


Chapter 5: From Tabular Learning to Approximation

5.1 Off-Policy TD

Setup: MDP with states and actions . Behavior policy : uniform (0.5/0.5). Target policy : , . Undiscounted ().


Q5.1.1: Calculate and

Solution

Using value iteration (start from terminal, work backwards):

  • , , (same for both policies)

For :


Q5.1.2: One Pass of SARSA ()

Data: ,

Initial Q-table:

-10.5
-1+1

Key Insight

Only changes — all other Q-values already equal their target values.

First transition :

  • Target:
  • Update:

Second transition :

  • Target:
  • Update:

On-policy SARSA moves Q toward

The update pushes from 0.5 toward 0 (which is ). Repeated passes would converge to .


Q5.1.3: SARSA with Importance Weights

First transition: IS ratio

  • Update:

Second transition: IS ratio

  • Update:

Off-policy IS-SARSA moves Q toward

Now the update pushes toward (which is ). The importance weights correct the distribution.


Q5.1.4: Why is Q-Learning Off-Policy?

Answer

In Q-Learning, the target policy (greedy: ) differs from the behavior policy (e.g., ε-greedy). The update uses regardless of which action was actually taken — learning about the optimal policy while following an exploratory one.


Q5.1.5: Q-Learning vs IS-SARSA (Greedy Target)

Both converge to the same , but IS-SARSA wastes samples when and disagree (ratio = 0 for off-greedy actions), and has variance issues when ratio is large. Q-learning is preferred — it implicitly handles the off-policy correction through the max operator.


Q5.1.6: Why Not Off-Policy V-Learning?

Off-policy V-learning (TD(0) with IS) is possible but less useful because:

  • In prediction: we usually want (evaluate current behavior), so off-policy isn’t needed
  • In control: off-policy is important, but V-functions require a model for policy improvement (). Q-functions don’t need the model.

Q5.1.7: Q-Learning for V Functions?

No. Q-learning works by taking over targets. For , we’d need — but we only observe the reward and next state for the action actually taken. We don’t have data for all actions from each state. In Q-learning, each is stored separately, so this isn’t an issue.


5.2 Function Approximation and State Distribution

Q5.2.1-3: Dependence on Parameters

  1. depends on the policy, which depends on the value function approximator’s parameters . Changing → changes → changes which states are visited → changes .

  2. In supervised learning, the data distribution is fixed and independent of model parameters. In RL, the data distribution changes as the agent learns.

  3. This means the weighting in the objective () is itself non-stationary — the states we care most about change as we learn.


Chapter 6: On-Policy TD with Approximation

6.1 On-Policy Distributions and LSTD

Setup: 2-state MDP with , features , . Initial distribution . Transitions: from each state, to , to terminal/other. Rewards: from (one transition), from (one transition).


Q6.1.1: On-Policy Distribution

Solution

Solve :

Solution:

Normalize:


Q6.1.2: Transition Frequencies

  • From : each transition occurs with frequency
  • From : each transition occurs with frequency

Q6.1.3: LSTD Solution

LSTD Computation

Weight each transition by its frequency:

Computing:

Solution:


6.2 Basis Functions

Q6.2.1: Tabular as Special Case of Linear FA

Answer

Use one-hot feature vectors. For state in an -state MDP: (standard basis vector, 1 at position , 0 elsewhere).

Then:

Each state has its own independent weight — exactly tabular RL.

Q6.2.2: Linear vs Non-Linear FA Advantages

Linear:

  • Easier gradient ()
  • LSTD: closed-form TD fixed point
  • Strong convergence guarantees

Non-Linear:

  • More expressive (better performance with enough data)
  • Automatic feature learning (no manual design)
  • Flexible architectures (CNNs, Transformers, etc.)

6.3 Semi-Gradient TD and the TD Fixed Point

Setup: 4-state MDP (travel costs), . Linear approximation with features: , , , , .


Q6.3.1: Semi-Gradient Update

Given , transition , learning rate :

Solution

  • (with : target = … wait, actually )
  • TD error:

Note: The solution in the answer key gives using a slightly different interpretation of . Check the feature computation carefully with the specific MDP rewards.


Q6.3.2: LSTD vs Semi-Gradient TD Relationship

Key Result

LSTD finds the TD Fixed Point directly. Semi-gradient TD, if it converges, converges to the same TD fixed point. They target the same solution — LSTD computes it in closed form, semi-gradient TD converges to it iteratively.


Q6.3.3: LSTD on Given Trajectories

Trajectories:

Full LSTD Computation

Computing each term (outer products):

Wait — let me redo with correct features. , , , :

Following the answer key:

Wait, using the answer key:

Approximate values:

Per the answer key: giving , , , .


Q6.3.4: Quality of the Solution

Where It Fails

The “top route” (): features capture the value well (true values: , ).

The “bottom route” (): features struggle. should be () and . But the features and can’t independently represent these — ‘s value is tied to , which also affects .

The TD fixed point makes a trade-off, weighted by the on-policy distribution .


Q6.3.5: “Never Forgetting” (LSTD)

The TD fixed point is a function of all data ever seen (via the and matrices). LSTD uses all past transitions equally — it “never forgets.”

Advantage: More sample efficient — no data is thrown away. Disadvantage: If the MDP or policy changes (non-stationarity), old data becomes misleading. Want to gradually forget old experience to adapt.


Q6.3.6: Neural Network Semi-Gradient Update

# For transition (s, a, r, s', a'):
val = NN_w(s)           # forward pass
val_prime = NN_w(s')    # forward pass (no grad needed)
val.backward()          # backward pass: computes ∂NN/∂w → w.grad
# Semi-gradient update:
w = w + alpha * (r + gamma * val_prime - val) * w.grad

Semi-Gradient: No Gradient Through Target

val_prime is treated as a constant (no .backward() through it). This is what makes it “semi-gradient.” The gradient only flows through the prediction , not the target .


6.4 Preparatory Question: Off-Policy Approximation

Baird's Counterexample

A notebook exercise on Canvas demonstrates the Deadly Triad — semi-gradient TD with linear FA diverges under off-policy updates. See RL-L07 - Off-Policy RL with Approximation and Off-Policy Divergence for theory.