Adagrad
Definition
Adagrad (Adaptive Gradient Algorithm)
Adagrad is an adaptive learning-rate optimization method that gives each parameter its own learning rate, scaled inversely to the square root of the sum of all past squared gradients for that parameter. Parameters with frequently large gradients get small effective steps; rarely-updated (sparse) parameters get large steps.
Intuition
Per-Parameter Rates from Accumulated History
Plain Gradient Descent uses a single global step size for every weight. But in problems with sparse or differently-scaled features, some directions need big steps and others tiny ones. Adagrad tracks, for each parameter, how much gradient “energy” it has seen so far (the running sum of squared gradients ). A parameter that has accumulated large gradients is in a steep, frequently-active direction, so its step is shrunk; a parameter rarely touched keeps a large step. This automatic per-coordinate scaling is what made Adagrad effective for sparse data (e.g. text/NLP features).
Mathematical Formulation
For each parameter , Adagrad accumulates the squared gradients and divides the global step size by their square root.
Adagrad Update
where:
- — current gradient of the loss w.r.t. parameter
- — running sum of all squared gradients up to step (i.e. ), one accumulator per parameter
- — global (base) learning rate
- — small constant for numerical stability (avoids division by zero, e.g. )
The factor is the effective per-parameter learning rate. Because is monotonically non-decreasing, this effective rate only ever shrinks over time.
Key Properties / Variants
- Per-parameter adaptation: each weight gets an individual learning rate without manual tuning.
- Great for sparse features: infrequently-updated parameters retain large effective steps, so rare-but-informative features still learn quickly.
- Decaying learning rate (the core weakness): accumulates all history and never shrinks, so eventually drives the effective step toward zero, halting learning prematurely. This is the failure mode RMSProp and Adam were designed to fix by replacing the cumulative sum with an exponentially decaying average of squared gradients.
- Less common in deep RL/DL today: superseded in practice by RMSProp and Adam, though it remains a strong baseline for convex / sparse problems.
Algorithm: Adagrad
──────────────────────────────────────────────
Require: base learning rate α, small ε
Initialize parameters w
Initialize accumulator G ← 0 (same shape as w)
Loop for each step t:
Sample / receive data, compute gradient g ← ∂L/∂w
G ← G + g ⊙ g # elementwise accumulate squared grads
w ← w - α * g / (sqrt(G) + ε) # elementwise per-parameter update
until convergedConnections
- Type of: Gradient Descent (adaptive learning-rate variant), alternative to plain SGD
- Improved by: RMSProp (decaying average instead of full sum), Adam (decaying average + Momentum)
- Contrast: Momentum adapts the direction; Adagrad adapts the per-parameter magnitude
- Used for: training Neural Networks in Deep RL