State Aggregation

Definition

State Aggregation

State aggregation is the simplest form of Function Approximation: the state space is partitioned into disjoint groups, and a single estimated value is shared by every state in a group. The estimate for a state is just the component of the weight vector $w$ assigned to its group; updating one state updates the value of every state in that group.

Intuition

One Weight Per Bucket

Imagine 1000 ordered states divided into 10 buckets of 100. Instead of storing 1000 values, you store 10 — one per bucket. All states in a bucket are treated as indistinguishable. The learned value function therefore looks like a staircase: flat within each group, jumping only at group boundaries.

This is exactly the tabular case “coarsened”: with one group per state you recover an exact lookup table; with fewer groups you trade fidelity for compactness and generalization.

Mathematical Formulation

State aggregation is a special case of Linear Function Approximation in which the feature vector $x (s)$ is one-hot: it has a single $1$ in the position of $s$ ‘s group and $0$ elsewhere.

Value Estimate

$\overset{v}{^} (s, w) = w^{⊤} x (s) = w_{g (s)}$

where:

$g (s) \in {1, \dots, d}$ — the group (aggregate) that state $s$ belongs to

$x (s)$ — one-hot feature vector, $x_{i} (s) = 1 [g (s) = i]$

$w \in R^{d}$ — one weight per group ( $d$ = number of groups $≪ ∣ S ∣$ )

Because the feature is one-hot, the gradient is also one-hot and the semi-gradient update touches only the active group’s weight:

Semi-Gradient Update

$w_{g (S_{t})} \leftarrow w_{g (S_{t})} + α [U_{t} - \overset{v}{^} (S_{t}, w)]$

where:

$U_{t}$ — update target (e.g., the Monte Carlo return $G_{t}$ , or the bootstrapped TD target $R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w)$ )

$α$ — step size

only $w_{g (S_{t})}$ changes; $\nabla_{w} \overset{v}{^} (S_{t}, w) = x (S_{t})$ is one-hot

What It Converges To

Under Monte Carlo / gradient updates, each group’s weight converges to the $μ$ -weighted average of the true values of the states in that group: $w_{i}^{\*} = \frac{\sum _{s : g (s) = i} μ ( s ) v _{π} ( s )}{\sum _{s : g (s) = i} μ ( s )}$

where $μ (s)$ is the on-policy distribution (fraction of time spent in $s$ ). Minimizing the MSVE within a group thus yields a $μ$ -weighted mean.

Key Properties / Variants

Special case of linear FA: one-hot features make every result for Linear Function Approximation apply — in particular semi-gradient TD(0) converges to the TD Fixed Point.
Staircase value function: $\overset{v}{^}$ is piecewise-constant across groups; it cannot represent variation within a group, so resolution is limited by the partition.
Distribution bias: accuracy follows $μ (s)$ . In the 1000-state random-walk example, central states (visited most often) are approximated best, and the staircase is visibly skewed near the edges of each group.
Generalization vs. discrimination trade-off: coarse groups generalize strongly (one update affects many states) but cannot distinguish states that need different values; fine groups discriminate but generalize little and approach the tabular cost.
No overlap (unlike Tile Coding): groups are disjoint, so exactly one feature is active. Tile coding generalizes state aggregation by using multiple overlapping partitions, giving smoother generalization while keeping binary features.

Algorithm: Gradient MC with State Aggregation
──────────────────────────────────────────────
Input: aggregation map g(·) into d groups, step size α
Initialize w ∈ R^d  (e.g., w = 0)
 
Loop for each episode:
  Generate S0, A0, R1, ..., S_T following π
  Loop for t = 0, 1, ..., T-1:
    G ← return from step t  (Σ_{k} γ^k R_{t+1+k})
    i ← g(S_t)                      # active group
    w[i] ← w[i] + α (G - w[i])      # only this weight updates

Resolution Is Hard-Coded

State aggregation cannot represent any structure finer than its partition: two states forced into the same group are permanently assigned the same value, no matter how much data you collect. Choosing a good aggregation is itself feature engineering, and a poor partition caps achievable accuracy regardless of $α$ or sample count.

Connections

Special case of: Linear Function Approximation (one-hot features), hence of Function Approximation
Coarsens: Tabular Methods (one group per state recovers the exact table)
Generalized by: Tile Coding (multiple overlapping partitions)
Trained with: Semi-Gradient Methods, Monte Carlo Methods / Temporal Difference Learning targets
Accuracy governed by: On-Policy Distribution $μ (s)$ and MSVE
Partial-observability analogue: using observations as state ( $S = O$ ) is state aggregation where the aggregation feature is the observation over the latent state — see Partial Observability

Study Notes

Explorer

State Aggregation

State Aggregation

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks