RL Lecture 5: From Tabular Learning to Approximation
0. The Big Picture: Where we are
So far, we have covered model-free methods that learn directly from experience:
- Monte Carlo (MC): Uses full episode returns .
- Temporal Difference (TD): Uses one-step bootstrapping targets .
- Q-learning/SARSA: Control methods for finding optimal policies.
The Taxonomy
- Model-based (DP): Requires . Uses “Full Backup” (lookahead tree).
- Model-free (RL): Only requires data.
- Monte Carlo: Sample episodes (Horizontal chain).
- Temporal Difference: Sample steps (1-step arrow).
Why no Importance Sampling in Q-learning?
In off-policy Monte Carlo, we need importance weights to correct the return from the behavior policy to the target policy .
In Q-learning, we do not need importance weights because the target is computed using the target policy (by taking the max over actions), rather than using a sampled return from the behavior policy.
The RL Methodology Space (Slide 15)
The 2D Space of RL
Reinforcement learning methods can be mapped along two axes:
- Sampling vs. Exhaustive (Horizontal): Does the method use samples (MC/TD) or exhaustive backups over all possible next states (DP)?
- Width vs. Depth (Vertical): Does the method use partial bootstrapping (TD) or full-depth returns (MC)?
VFA (Function Approximation) sits atop this space, allowing us to apply these concepts to larger, continuous domains where neither a full table nor exhaustive backups are possible.
1. Why Function Approximation?
Tabular methods, where each state has a dedicated entry in a lookup table, fail in most real-world applications due to:
- Curse of Dimensionality: The number of states grows exponentially with the state variables (e.g., in Backgammon , in Go ).
- Generalization: In large/continuous spaces, we almost never see the exact same state twice. We need a way to generalize from limited experience to “similar” unseen states.
Key Shift
Instead of a table, we use a parameterized functional form , where is a weight vector with significantly fewer parameters than states ().
2. Value Function Approximation (VFA) Setup
The approximate value function is represented as a differentiable function of a weight vector .
- Linear Methods:
- is a feature vector representing state .
- Non-linear Methods: e.g., Neural Networks where represents connection weights.
Tabular as a Special Case
Tabular learning is a special case of linear function approximation where:
- is a one-hot (indicator) vector: if , else .
- Updating one state does not affect others (zero generalization).
3. The Prediction Objective: Mean Squared Value Error ()
In approximation, we cannot match the true value exactly for all states. We must decide which states matter more using a state distribution , where .
The Mean Squared Value Error () is defined as:
- (On-policy distribution): Usually the fraction of time spent in state under policy .
- In continuing tasks: The stationary distribution.
- In episodic tasks: Depends on the start state distribution and transition probability.
4. Stochastic Gradient Descent (SGD)
To minimize , we adjust weights in the direction of the negative gradient.
Ideal Gradient Update
If the true value was known:
Where is the gradient vector of partial derivatives with respect to :
General SGD with Targets
Since is unknown, we use a target :
- Monte Carlo Target: (unbiased, converges to local optimum).
- TD Target: (biased, bootstrapping).
5. Semi-Gradient TD(0)
When using bootstrapping targets like in TD, the target depends on the current weights . A true gradient would need to take the derivative of both the prediction AND the target.
Semi-gradient methods ignore the dependence of the target on . They only take the gradient of the prediction .
Why “Semi”?
- True Gradient: would include .
- Semi-Gradient: Treats as a constant during the update.
Pseudocode: Semi-gradient TD(0)
Algorithm: Semi-gradient TD(0) for estimating
Input: Policy , differentiable function Parameters: Step size Initialize: arbitrarily (e.g., )
Loop for each episode: Initialize Loop for each step of episode: Choose Take action , observe until is terminal
Convergence Properties
- Gradient MC: Guaranteed to converge to a local optimum (global optimum for linear cases).
- Semi-Gradient TD: Converges to a TD Fixed Point , which is near the global optimum.
- Error Bound: .
- As , the bound becomes loose.
6. Visualizing Approximation (1000-state Random Walk)
Based on the lecture slides and Chapter 9 figures:
State Aggregation (Slide/Book Fig 9.1)
This is a form of function approximation where states are grouped.
- States 1-1000: Divided into 10 groups of 100.
- Staircase effect: The learned value function is constant within each group.
- Distribution bias: Because states in the center (near 500) are visited more often ( is higher), the approximation is more accurate there.
Learning Targets Comparison (Slide/Book Fig 9.2)
- Monte Carlo: Asymptotic error is lower (can reach global optimum).
- n-step TD: Faster initial learning but potentially higher asymptotic error.
- Linearity: In the linear case, TD(0) convergence is stable on-policy but can diverge off-policy (the “Deadly Triad”).
7. Worked Example Summary
Trajectory: 500 501 502 … With initial , , and linear features (bins):
- State 500 (Bin 5): .
- Observed Reward , Next State 501 (Bin 6).
- Target: .
- Error: . Weight stays .
- Later in episode, when reaching terminal state with : The weights for the bins leading to the end will increase based on the backpropagated rewards (bootstrapping for TD, or full return for MC).
Summary Table: Tabular vs. Function Approx
| Feature | Tabular | Function Approximation |
|---|---|---|
| Resolution | Single state level | Grouped/Functional level |
| Generalization | None | High (through shared ) |
| State Space | Small/Discrete | Large/Continuous |
| Memory | $\mathcal{O}( | \mathcal{S} |
| Update Impact | Local to state | Global (affects many states) |