RL Lecture 5: From Tabular Learning to Approximation

0. The Big Picture: Where we are

So far, we have covered model-free methods that learn directly from experience:

Monte Carlo (MC): Uses full episode returns $G_{t}$ .
Temporal Difference (TD): Uses one-step bootstrapping targets $R + γ \hat{V} (S^{'})$ .
Q-learning/SARSA: Control methods for finding optimal policies.

The Taxonomy

Model-based (DP): Requires $p (s^{'}, r ∣ s, a)$ . Uses “Full Backup” (lookahead tree).

Model-free (RL): Only requires data.

Monte Carlo: Sample episodes (Horizontal chain).

Temporal Difference: Sample steps (1-step arrow).

Why no Importance Sampling in Q-learning?

In off-policy Monte Carlo, we need importance weights $ρ$ to correct the return $G_{t}$ from the behavior policy $b$ to the target policy $π$ . In Q-learning, we do not need importance weights because the target is computed using the target policy (by taking the max over actions), rather than using a sampled return from the behavior policy.

The RL Methodology Space (Slide 15)

The 2D Space of RL

Reinforcement learning methods can be mapped along two axes:

Sampling vs. Exhaustive (Horizontal): Does the method use samples (MC/TD) or exhaustive backups over all possible next states (DP)?

Width vs. Depth (Vertical): Does the method use partial bootstrapping (TD) or full-depth returns (MC)?

VFA (Function Approximation) sits atop this space, allowing us to apply these concepts to larger, continuous domains where neither a full table nor exhaustive backups are possible.

1. Why Function Approximation?

Tabular methods, where each state $s$ has a dedicated entry $V (s)$ in a lookup table, fail in most real-world applications due to:

Curse of Dimensionality: The number of states grows exponentially with the state variables (e.g., in Backgammon $\approx 1 0^{20}$ , in Go $\approx 1 0^{170}$ ).
Generalization: In large/continuous spaces, we almost never see the exact same state twice. We need a way to generalize from limited experience to “similar” unseen states.

Key Shift

Instead of a table, we use a parameterized functional form $\overset{v}{^} (s, w) \approx v_{π} (s)$ , where $w \in R^{d}$ is a weight vector with significantly fewer parameters than states ( $d ≪ ∣ S ∣$ ).

2. Value Function Approximation (VFA) Setup

The approximate value function is represented as a differentiable function of a weight vector $w$ .

Linear Methods: $\overset{v}{^} (s, w) = w^{⊤} x (s) = \sum_{i = 1}^{d} w_{i} x_{i} (s)$
- $x (s)$ is a feature vector representing state $s$ .
Non-linear Methods: e.g., Neural Networks where $w$ represents connection weights.

Tabular as a Special Case

Tabular learning is a special case of linear function approximation where:

$d = ∣ S ∣$
$x (s)$ is a one-hot (indicator) vector: $x_{i} (s) = 1$ if $i = s$ , else $0$ .
Updating one state does not affect others (zero generalization).

3. The Prediction Objective: Mean Squared Value Error ( $\overline{VE}$ )

In approximation, we cannot match the true value $v_{π} (s)$ exactly for all states. We must decide which states matter more using a state distribution $μ (s)$ , where $\sum_{s} μ (s) = 1$ .

The Mean Squared Value Error ( $\overline{VE}$ ) is defined as: $\overline{VE} (w) ≐ \sum_{s \in S} μ (s) [v_{π} (s) - \overset{v}{^} (s, w)]^{2}$

$μ (s)$ (On-policy distribution): Usually the fraction of time spent in state $s$ under policy $π$ .
- In continuing tasks: The stationary distribution.
- In episodic tasks: Depends on the start state distribution $h (s)$ and transition probability.

4. Stochastic Gradient Descent (SGD)

To minimize $\overline{VE}$ , we adjust weights in the direction of the negative gradient.

Ideal Gradient Update

If the true value $v_{π} (S_{t})$ was known: $w_{t + 1} = w_{t} + α [v_{π} (S_{t}) - \overset{v}{^} (S_{t}, w_{t})] \nabla \overset{v}{^} (S_{t}, w_{t})$

Where $\nabla \overset{v}{^} (s, w)$ is the gradient vector of partial derivatives with respect to $w$ : $\nabla \overset{v}{^} (s, w) ≐ [\frac{\partial v ^ ( s , w )}{\partial w _{1}}, \frac{\partial v ^ ( s , w )}{\partial w _{2}}, \dots, \frac{\partial v ^ ( s , w )}{\partial w _{d}}]^{⊤}$

General SGD with Targets

Since $v_{π} (S_{t})$ is unknown, we use a target $U_{t}$ : $w_{t + 1} = w_{t} + α [U_{t} - \overset{v}{^} (S_{t}, w_{t})] \nabla \overset{v}{^} (S_{t}, w_{t})$

Monte Carlo Target: $U_{t} = G_{t}$ (unbiased, converges to local optimum).
TD Target: $U_{t} = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t})$ (biased, bootstrapping).

5. Semi-Gradient TD(0)

When using bootstrapping targets like in TD, the target $U_{t}$ depends on the current weights $w_{t}$ . A true gradient would need to take the derivative of both the prediction AND the target.

Semi-gradient methods ignore the dependence of the target on $w$ . They only take the gradient of the prediction $\overset{v}{^} (S_{t}, w_{t})$ .

Why “Semi”?

True Gradient: $\nabla (E [R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w)] - \overset{v}{^} (S_{t}, w))^{2}$ would include $\nabla \overset{v}{^} (S_{t + 1}, w)$ .
Semi-Gradient: Treats $R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w)$ as a constant during the update.

Pseudocode: Semi-gradient TD(0)

Algorithm: Semi-gradient TD(0) for estimating $\overset{v}{^} \approx v_{π}$

Input: Policy $π$ , differentiable function $\overset{v}{^} (s, w)$ Parameters: Step size $α > 0$ Initialize: $w$ arbitrarily (e.g., $w = 0$ )

Loop for each episode: Initialize $S$ Loop for each step of episode: Choose $A \sim π (\cdot ∣ S)$ Take action $A$ , observe $R, S^{'}$ $w \leftarrow w + α [R + γ \overset{v}{^} (S^{'}, w) - \overset{v}{^} (S, w)] \nabla \overset{v}{^} (S, w)$ $S \leftarrow S^{'}$ until $S$ is terminal

Convergence Properties

Gradient MC: Guaranteed to converge to a local optimum (global optimum for linear cases).
Semi-Gradient TD: Converges to a TD Fixed Point $w_{T D}$ , which is near the global optimum.
Error Bound: $\overline{VE} (w_{T D}) \leq \frac{1}{1 - γ} min_{w} \overline{VE} (w)$ .
- As $γ \to 1$ , the bound becomes loose.

6. Visualizing Approximation (1000-state Random Walk)

Based on the lecture slides and Chapter 9 figures:

State Aggregation (Slide/Book Fig 9.1)

This is a form of function approximation where states are grouped.

States 1-1000: Divided into 10 groups of 100.
Staircase effect: The learned value function is constant within each group.
Distribution bias: Because states in the center (near 500) are visited more often ( $μ (s)$ is higher), the approximation is more accurate there.

Learning Targets Comparison (Slide/Book Fig 9.2)

Monte Carlo: Asymptotic error is lower (can reach global optimum).
n-step TD: Faster initial learning but potentially higher asymptotic error.
Linearity: In the linear case, TD(0) convergence is stable on-policy but can diverge off-policy (the “Deadly Triad”).

7. Worked Example Summary

Trajectory: 500 $R = 0$ 501 $R = 0$ 502 … With initial $w = 0$ , $α = 0.1$ , and linear features (bins):

State 500 (Bin 5): $\overset{v}{^} (500, w) = 0$ .
Observed Reward $R = 0$ , Next State 501 (Bin 6).
Target: $0 + 1.0 \times \overset{v}{^} (501, w) = 0$ .
Error: $0 - 0 = 0$ . Weight stays $0$ .
Later in episode, when reaching terminal state with $R = 1$ : The weights for the bins leading to the end will increase based on the backpropagated rewards (bootstrapping for TD, or full return for MC).

Summary Table: Tabular vs. Function Approx

Feature	Tabular	Function Approximation
Resolution	Single state level	Grouped/Functional level
Generalization	None	High (through shared $w$ )
State Space	Small/Discrete	Large/Continuous
Memory	$\mathcal{O}(	\mathcal{S}
Update Impact	Local to state $S_{t}$	Global (affects many states)

Study Notes

Explorer

RL-L05 - Tabular to Approximation

RL Lecture 5: From Tabular Learning to Approximation

0. The Big Picture: Where we are

Why no Importance Sampling in Q-learning?

The RL Methodology Space (Slide 15)

1. Why Function Approximation?

2. Value Function Approximation (VFA) Setup

Tabular as a Special Case

3. The Prediction Objective: Mean Squared Value Error ( $\overline{VE}$ )

4. Stochastic Gradient Descent (SGD)

Ideal Gradient Update

General SGD with Targets

5. Semi-Gradient TD(0)

Why “Semi”?

Pseudocode: Semi-gradient TD(0)

Convergence Properties

6. Visualizing Approximation (1000-state Random Walk)

State Aggregation (Slide/Book Fig 9.1)

Learning Targets Comparison (Slide/Book Fig 9.2)

7. Worked Example Summary

Summary Table: Tabular vs. Function Approx

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

RL-L05 - Tabular to Approximation

RL Lecture 5: From Tabular Learning to Approximation

0. The Big Picture: Where we are

Why no Importance Sampling in Q-learning?

The RL Methodology Space (Slide 15)

1. Why Function Approximation?

2. Value Function Approximation (VFA) Setup

Tabular as a Special Case

3. The Prediction Objective: Mean Squared Value Error (VE)

4. Stochastic Gradient Descent (SGD)

Ideal Gradient Update

General SGD with Targets

5. Semi-Gradient TD(0)

Why “Semi”?

Pseudocode: Semi-gradient TD(0)

Convergence Properties

6. Visualizing Approximation (1000-state Random Walk)

State Aggregation (Slide/Book Fig 9.1)

Learning Targets Comparison (Slide/Book Fig 9.2)

7. Worked Example Summary

Summary Table: Tabular vs. Function Approx

Graph View

Table of Contents

Backlinks

3. The Prediction Objective: Mean Squared Value Error ( $\overline{VE}$ )