RL Lecture 6: On-Policy TD Learning with Approximation
Overview
This lecture explores how to extend Temporal Difference Learning to large or continuous state spaces using Function Approximation. We focus on on-policy prediction, where the goal is to estimate the value function for a fixed policy using parameterized functional forms instead of tables.
1. Value Function Approximation
In large state spaces, we cannot store a value for every state. Instead, we represent the value function with a weight vector : Typically, , meaning changing one weight affects many states (generalization).
Mean Squared Value Error (VE)
To evaluate the approximation, we use the weighted mean squared error over the state distribution : Where is usually the on-policy distribution (fraction of time spent in state under ).
2. Linear Function Approximation
A common and tractable case is Linear Function Approximation, where the estimate is a linear combination of features:
Linear Value Function
where is a feature vector representing state .
Gradient Descent Updates
For linear methods, the gradient with respect to is simply the feature vector:
The general Stochastic Gradient Descent (SGD) update rule is: For linear methods, this simplifies to:
3. Semi-Gradient TD(0)
When the target depends on the current weights (e.g., in Bootstrapping), the update does not follow the true gradient of the error. We call these Semi-Gradient Methods.
Semi-Gradient TD(0) Update
The TD Fixed Point
In the linear case, TD(0) converges to the TD Fixed Point , which satisfies the following system of linear equations: Where:
Convergence Bound
While Monte Carlo Methods converge to the global minimum of the VE, linear TD(0) converges to a point whose error is bounded relative to the best possible error: This expansion factor can be large if .
Pseudocode: Linear Semi-Gradient TD(0)
Algorithm: Linear Semi-Gradient TD(0) for estimating v̂ ≈ v_π
──────────────────────────────────────────────────────────────
Input: policy π, step-size α > 0
Input: differentiable v̂(s,w) = w^T · x(s)
Initialize: w arbitrarily (e.g., w = 0)
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A ~ π(·|S)
Take action A, observe R, S'
If S' is terminal:
w ← w + α[R - v̂(S,w)] · x(S)
Go to next episode
w ← w + α[R + γ·v̂(S',w) - v̂(S,w)] · x(S)
S ← S'4. Feature Construction
The performance of linear methods depends entirely on the choice of Feature Construction.
4.1 Polynomials
States are represented as powers and products of state variables.
- Example for 2D state :
- Allows modeling interactions but doesn’t scale well to high dimensions
4.2 Fourier Basis
Uses cosine functions of different frequencies: where is an integer vector specifying frequencies along each dimension.
Step-Size Scaling for Fourier
Konidaris et al. (2011) suggest per-feature step sizes: (except when all , use ).
4.3 Coarse Coding
Binary features representing overlapping “receptive fields” (e.g., circles in 2D). A state activates a feature if it falls inside the corresponding region.
- Large receptive fields → broad generalization, low resolution
- Small receptive fields → narrow generalization, high resolution
- More features → finer discrimination but slower learning
4.4 Radial Basis Functions (RBF)
Continuous version of coarse coding. Feature value depends on distance to center :
RBF Feature
Provides smooth, differentiable approximation. Continuous-valued features (unlike binary coarse coding).
5. Tile Coding
Tile Coding is the most practically important feature construction method for RL.
Tilings and Tiles
The state space is partitioned into a grid called a tiling. Each grid cell is a tile (a binary feature). Multiple overlapping tilings, each offset from the others, are used to achieve both generalization and fine resolution.
How It Works
- Define tilings over the state space, each a regular grid
- Each tiling is offset by a fraction of the tile width
- For a given state : exactly one tile per tiling is active → active features total
- (sum of active tile weights)
Key Properties
- Binary features: Updates are just additions to active tile weights
- Fixed cost: Always exactly active features, regardless of state space size
- Step-size scaling: Use to account for tilings contributing
- Hashing: Map large tile spaces to smaller arrays using hash function — handles curse of dimensionality
Displacement Vectors
Uniform offsets (equal in all dimensions) create diagonal artifacts. Asymmetric offsets using displacement vectors like times the fundamental unit produce better, more isotropic generalization.
6. Least-Squares TD (LSTD)
Instead of iterative updates, LSTD estimates the matrix and vector directly from data to solve .
The Algorithm
Sherman-Morrison Update
To avoid matrix inversion every step, update directly in :
LSTD Trade-offs
| LSTD | Semi-Gradient TD | |
|---|---|---|
| Step-size ? | No (direct solution) | Yes (sensitive to tuning) |
| Data efficiency | Higher (no data wasted) | Lower (iterative) |
| Per-step computation | ||
| Memory | (stores ) |
LSTD "Never Forgets"
LSTD uses all past transitions equally — the TD fixed point depends on all data ever seen. This is sample efficient but problematic if the policy or environment changes (non-stationarity).
7. Neural Network Function Approximation
Neural Network Function Approximation allows for nonlinear value functions: .
Architecture
A feedforward network maps state features through hidden layers: where is a non-linear activation function (ReLU, sigmoid, etc.).
Semi-Gradient Update with Neural Nets
Same update rule as linear case, but gradient computed via backpropagation:
Challenges of Neural Networks in RL
- Non-stationarity: Targets change as the network learns (bootstrapping moves the goal)
- Correlated data: Sequential RL data violates i.i.d. assumption of SGD
- Catastrophic forgetting: Learning new states can degrade performance on previously learned states
- No convergence guarantees: Unlike linear semi-gradient TD, non-linear methods have no guaranteed convergence
These challenges motivate the stabilization techniques in Deep Q-Network (DQN): Experience Replay and Target Network.
8. Summary: Method Comparison
| Feature Type | Representation | Key Property |
|---|---|---|
| State Aggregation | One-hot over partitions | Simplest; piecewise constant |
| Polynomials | Powers of state variables | Global; poor scaling |
| Fourier Basis | Cosine functions | Good for smooth functions |
| Coarse Coding | Binary overlapping regions | Local generalization |
| Tile Coding | Multiple offset grids | Efficient; tunable; practical |
| RBF | Gaussian bumps | Smooth; computationally expensive |
| Neural Networks | Learned non-linear features | Most expressive; least stable |