Adagrad

Definition

Adagrad (Adaptive Gradient Algorithm)

Adagrad is an adaptive learning-rate optimization method that gives each parameter its own learning rate, scaled inversely to the square root of the sum of all past squared gradients for that parameter. Parameters with frequently large gradients get small effective steps; rarely-updated (sparse) parameters get large steps.

Intuition

Per-Parameter Rates from Accumulated History

Plain Gradient Descent uses a single global step size $α$ for every weight. But in problems with sparse or differently-scaled features, some directions need big steps and others tiny ones. Adagrad tracks, for each parameter, how much gradient “energy” it has seen so far (the running sum of squared gradients $G_{t}$ ). A parameter that has accumulated large gradients is in a steep, frequently-active direction, so its step is shrunk; a parameter rarely touched keeps a large step. This automatic per-coordinate scaling is what made Adagrad effective for sparse data (e.g. text/NLP features).

Mathematical Formulation

For each parameter $w$ , Adagrad accumulates the squared gradients and divides the global step size by their square root.

Adagrad Update

$G_{t} = G_{t - 1} + g_{t}^{2} w \leftarrow w - \frac{α}{G _{t} + ϵ} g_{t}$

where:

$g_{t} = \frac{\partial L}{\partial w}$ — current gradient of the loss w.r.t. parameter $w$

$G_{t}$ — running sum of all squared gradients up to step $t$ (i.e. $G_{t} = \sum_{τ = 1}^{t} g_{τ}^{2}$ ), one accumulator per parameter

$α$ — global (base) learning rate

$ϵ$ — small constant for numerical stability (avoids division by zero, e.g. $1 0^{- 8}$ )

The factor $\frac{α}{G _{t} + ϵ}$ is the effective per-parameter learning rate. Because $G_{t}$ is monotonically non-decreasing, this effective rate only ever shrinks over time.

Key Properties / Variants

Per-parameter adaptation: each weight gets an individual learning rate without manual tuning.
Great for sparse features: infrequently-updated parameters retain large effective steps, so rare-but-informative features still learn quickly.
Decaying learning rate (the core weakness): $G_{t}$ accumulates all history and never shrinks, so $\frac{α}{G _{t}}$ eventually drives the effective step toward zero, halting learning prematurely. This is the failure mode RMSProp and Adam were designed to fix by replacing the cumulative sum with an exponentially decaying average of squared gradients.
Less common in deep RL/DL today: superseded in practice by RMSProp and Adam, though it remains a strong baseline for convex / sparse problems.

Algorithm: Adagrad
──────────────────────────────────────────────
Require: base learning rate α, small ε
Initialize parameters w
Initialize accumulator G ← 0   (same shape as w)
 
Loop for each step t:
  Sample / receive data, compute gradient g ← ∂L/∂w
  G ← G + g ⊙ g                 # elementwise accumulate squared grads
  w ← w - α * g / (sqrt(G) + ε) # elementwise per-parameter update
until converged

Connections

Type of: Gradient Descent (adaptive learning-rate variant), alternative to plain SGD
Improved by: RMSProp (decaying average instead of full sum), Adam (decaying average + Momentum)
Contrast: Momentum adapts the direction; Adagrad adapts the per-parameter magnitude
Used for: training Neural Networks in Deep RL

Appears In

Adam
RMSProp

Study Notes

Explorer

Adagrad

Adagrad

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks