RL Lecture 5: From Tabular Learning to Approximation

0. The Big Picture: Where we are

So far, we have covered model-free methods that learn directly from experience:

  • Monte Carlo (MC): Uses full episode returns .
  • Temporal Difference (TD): Uses one-step bootstrapping targets .
  • Q-learning/SARSA: Control methods for finding optimal policies.

The Taxonomy

  • Model-based (DP): Requires . Uses “Full Backup” (lookahead tree).
  • Model-free (RL): Only requires data.
    • Monte Carlo: Sample episodes (Horizontal chain).
    • Temporal Difference: Sample steps (1-step arrow).

Why no Importance Sampling in Q-learning?

In off-policy Monte Carlo, we need importance weights to correct the return from the behavior policy to the target policy . In Q-learning, we do not need importance weights because the target is computed using the target policy (by taking the max over actions), rather than using a sampled return from the behavior policy.

The RL Methodology Space (Slide 15)

The 2D Space of RL

Reinforcement learning methods can be mapped along two axes:

  1. Sampling vs. Exhaustive (Horizontal): Does the method use samples (MC/TD) or exhaustive backups over all possible next states (DP)?
  2. Width vs. Depth (Vertical): Does the method use partial bootstrapping (TD) or full-depth returns (MC)?

VFA (Function Approximation) sits atop this space, allowing us to apply these concepts to larger, continuous domains where neither a full table nor exhaustive backups are possible.


1. Why Function Approximation?

Tabular methods, where each state has a dedicated entry in a lookup table, fail in most real-world applications due to:

  • Curse of Dimensionality: The number of states grows exponentially with the state variables (e.g., in Backgammon , in Go ).
  • Generalization: In large/continuous spaces, we almost never see the exact same state twice. We need a way to generalize from limited experience to “similar” unseen states.

Key Shift

Instead of a table, we use a parameterized functional form , where is a weight vector with significantly fewer parameters than states ().


2. Value Function Approximation (VFA) Setup

The approximate value function is represented as a differentiable function of a weight vector .

  • Linear Methods:
    • is a feature vector representing state .
  • Non-linear Methods: e.g., Neural Networks where represents connection weights.

Tabular as a Special Case

Tabular learning is a special case of linear function approximation where:

  • is a one-hot (indicator) vector: if , else .
  • Updating one state does not affect others (zero generalization).

3. The Prediction Objective: Mean Squared Value Error ()

In approximation, we cannot match the true value exactly for all states. We must decide which states matter more using a state distribution , where .

The Mean Squared Value Error () is defined as:

  • (On-policy distribution): Usually the fraction of time spent in state under policy .
    • In continuing tasks: The stationary distribution.
    • In episodic tasks: Depends on the start state distribution and transition probability.

4. Stochastic Gradient Descent (SGD)

To minimize , we adjust weights in the direction of the negative gradient.

Ideal Gradient Update

If the true value was known:

Where is the gradient vector of partial derivatives with respect to :

General SGD with Targets

Since is unknown, we use a target :

  • Monte Carlo Target: (unbiased, converges to local optimum).
  • TD Target: (biased, bootstrapping).

5. Semi-Gradient TD(0)

When using bootstrapping targets like in TD, the target depends on the current weights . A true gradient would need to take the derivative of both the prediction AND the target.

Semi-gradient methods ignore the dependence of the target on . They only take the gradient of the prediction .

Why “Semi”?

  • True Gradient: would include .
  • Semi-Gradient: Treats as a constant during the update.

Pseudocode: Semi-gradient TD(0)

Algorithm: Semi-gradient TD(0) for estimating

Input: Policy , differentiable function Parameters: Step size Initialize: arbitrarily (e.g., )

Loop for each episode: Initialize Loop for each step of episode: Choose Take action , observe until is terminal

Convergence Properties

  • Gradient MC: Guaranteed to converge to a local optimum (global optimum for linear cases).
  • Semi-Gradient TD: Converges to a TD Fixed Point , which is near the global optimum.
  • Error Bound: .
    • As , the bound becomes loose.

6. Visualizing Approximation (1000-state Random Walk)

Based on the lecture slides and Chapter 9 figures:

State Aggregation (Slide/Book Fig 9.1)

This is a form of function approximation where states are grouped.

  • States 1-1000: Divided into 10 groups of 100.
  • Staircase effect: The learned value function is constant within each group.
  • Distribution bias: Because states in the center (near 500) are visited more often ( is higher), the approximation is more accurate there.

Learning Targets Comparison (Slide/Book Fig 9.2)

  • Monte Carlo: Asymptotic error is lower (can reach global optimum).
  • n-step TD: Faster initial learning but potentially higher asymptotic error.
  • Linearity: In the linear case, TD(0) convergence is stable on-policy but can diverge off-policy (the “Deadly Triad”).

7. Worked Example Summary

Trajectory: 500 501 502 … With initial , , and linear features (bins):

  1. State 500 (Bin 5): .
  2. Observed Reward , Next State 501 (Bin 6).
  3. Target: .
  4. Error: . Weight stays .
  5. Later in episode, when reaching terminal state with : The weights for the bins leading to the end will increase based on the backpropagated rewards (bootstrapping for TD, or full return for MC).

Summary Table: Tabular vs. Function Approx

FeatureTabularFunction Approximation
ResolutionSingle state levelGrouped/Functional level
GeneralizationNoneHigh (through shared )
State SpaceSmall/DiscreteLarge/Continuous
Memory$\mathcal{O}(\mathcal{S}
Update ImpactLocal to state Global (affects many states)