Fourier Basis

Definition

Fourier Basis

The Fourier basis is a family of feature functions for Linear Function Approximation that represents the value function as a weighted sum of cosines of different frequencies. For a state (normalized), each feature is , where the integer vector selects the frequency and the interactions among the state dimensions. It comes from the Fourier series: any sufficiently smooth periodic function can be reconstructed as a linear combination of sinusoids, so the learner only needs to fit the weights on a fixed set of cosine features.

Intuition

A value function over a continuous state is just some smooth curve (or surface). A Fourier series says any such function can be built by stacking cosines of increasing frequency: low-frequency terms capture the broad shape, high-frequency terms capture the fine detail. Instead of learning what features to use (as a neural net would), you fix the set of cosine waves and only learn how much of each to add — that is the weight vector .

Why cosines and not sines? On a bounded interval we only care about the function on , not its periodic extension. Using only cosines lets us treat the function as if it were even (mirrored about ), which avoids forcing discontinuities at the boundaries and means we don’t have to model the function’s behaviour outside the region of interest.

The frequency vector is the key design knob: a component means feature ignores state dimension , a large means feature oscillates rapidly along dimension , and multiple nonzero components let a single feature model interactions between dimensions.

Mathematical Formulation

For a -dimensional normalized state with each , the order- Fourier cosine basis has features:

The value estimate is then linear in these features:

where:

  • — the state, with each dimension normalized to
  • — integer frequency vector for feature ; its -th entry sets the oscillation rate along dimension
  • — the order of the basis; allowing each gives features in total
  • — scaling that maps the domain onto one half-period of the cosine
  • — learned weight on feature (the only thing trained)

Since the approximator is linear, the gradient is just the feature vector, , so the standard linear update applies:

where is the TD Error.

Key Properties / Variants

  • Global features: each cosine spans the whole state space, so the Fourier basis generalizes globally — unlike Tile Coding and coarse/RBF coding, which generalize locally. One update changes the estimate everywhere.
  • Order vs. dimensionality: an order- basis over dimensions needs features — it grows exponentially in the number of state dimensions, so it is practical only for low-dimensional states.
  • Per-feature step sizes: high-frequency features oscillate faster, so they benefit from smaller learning rates. A common trick is to scale each feature’s step size by (with for the constant feature) to keep learning stable across frequencies.
  • Outperforms polynomials: empirically the Fourier basis tends to learn faster and more accurately than the polynomial basis () on RL prediction tasks, while being just as simple to set up.
  • Cosine-only: using only cosines (not sines) suits aperiodic functions on a bounded interval by implicitly assuming an even extension, avoiding boundary discontinuities.
  • Coupling vs. decoupling: setting only one nonzero entry in each (a “decoupled” basis) ignores dimension interactions and keeps the feature count linear in ; allowing multiple nonzero entries (the “full” basis) models interactions at exponential cost.

Sketch of constructing the feature vector:

Build order-n Fourier feature vector x(s) for state s in [0,1]^k
──────────────────────────────────────────────────────────────
Precompute frequency set C = { c : c in {0,...,n}^k }   # (n+1)^k vectors
  (decoupled variant: keep only c with at most one nonzero entry)
 
function FEATURES(s):
    for each c_i in C:
        x_i ← cos( π · (s · c_i) )      # dot product s·c_i, then cosine
    return vector x = (x_0, x_1, ..., x_{d-1})
 
# Use x(s) inside any linear semi-gradient method:
#   v̂(s,w) = wᵀ x(s)
#   w ← w + α · δ · x(s)        # δ = TD error

Connections

Appears In