Momentum

Definition

Momentum

Momentum is a modification of Gradient Descent that accelerates optimization by accumulating an exponentially decaying moving average of past gradients (the velocity) and stepping in that direction, instead of stepping in the raw gradient direction. This smooths out noisy SGD updates and builds up speed along directions of persistent descent.

Intuition

A Ball Rolling Down the Loss Surface

Plain gradient descent is like a memoryless walker: at every step it looks only at the local slope. Momentum is like a heavy ball rolling downhill — it carries inertia from previous steps.

  • In a steep, narrow ravine (ill-conditioned curvature), plain SGD oscillates back and forth across the walls and crawls slowly along the valley floor. Momentum cancels the oscillating components (they alternate sign and average out) while reinforcing the consistent down-valley component (it keeps the same sign and accumulates).
  • Near small local bumps or noisy gradients, the accumulated velocity lets the optimizer coast through, smoothing stochastic noise.

The decay rate controls how much “memory” the ball has: larger means heavier inertia and more smoothing.

Mathematical Formulation

Maintain a velocity vector (the moving average of gradients) and update the parameters with it. Let be the gradient at step .

Momentum Update (velocity form)

where:

  • — parameters (weights) at step
  • — gradient of the loss w.r.t. the weights
  • — velocity: exponential moving average of gradients (the first moment)
  • — momentum / decay coefficient (typically ); larger = more inertia
  • — learning rate (step size)

An equivalent and very common formulation uses an accumulated update rather than a normalized average:

Momentum Update (accumulation form)

The two forms differ only by a constant factor absorbed into the effective learning rate. In the steady state where the gradient is constant , the velocity converges to , so momentum effectively scales the step by along persistent directions (e.g. for ).

Key Properties / Variants

  • First moment estimate: is exactly the first-moment (mean-of-gradients) term reused by Adam; Adam pairs it with the second-moment (RMSProp) term for per-parameter adaptive scaling.
  • Bias at start-up: with the moving-average form, biases early velocities toward zero. Adam corrects this with bias correction ; plain momentum usually ignores it.
  • Damps oscillation, accelerates valleys: cancels alternating-sign gradient components, accumulates consistent ones — the main reason it speeds up ill-conditioned problems.
  • Nesterov Accelerated Gradient (NAG): a variant that evaluates the gradient at the look-ahead point rather than at , giving a correction term and often faster, more stable convergence.
  • Hyperparameter coupling: effective step size grows with , so often needs reducing when momentum is added.
Algorithm: SGD with Momentum
─────────────────────────────────────────────
Input: learning rate α, momentum coefficient β
Initialize weights w, velocity v ← 0
 
Loop for each step t:
  Sample mini-batch; compute gradient  g ← ∇_w L(w)
  v ← β·v + (1 - β)·g          # accumulate velocity (moving avg)
  w ← w - α·v                  # step along velocity
until converged

Connections

Appears In