Adam

Definition

Adam (Adaptive Moment Estimation)

Adam is an optimization algorithm for gradient-based optimization of stochastic objective functions. It combines the advantages of Momentum (keeping track of the moving average of gradients) and RMSProp (scaling gradients by a moving average of squared gradients).

The Update Rule

Adam maintains two moving averages (moments):

First Moment ( $m_{t}$ ): Mean of gradients (Momentum) $m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$
Second Moment ( $v_{t}$ ): Uncentered variance of gradients (RMSProp) $v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$

After bias correction ( $\overset{m}{^}_{t} = \frac{m _{t}}{1 - β _{1}^{t}}$ and $\overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{2}^{t}}$ ), the weights are updated:

Adam Update

$w \leftarrow w - α \frac{m ^ _{t}}{v ^ _{t} + ϵ}$

where:

$α$ — learning rate (step size)

$β_{1}, β_{2}$ — decay rates for moment estimates (typically 0.9 and 0.999)

$ϵ$ — small constant to prevent division by zero (e.g., $1 0^{- 8}$ )

Key Advantages

Individual Learning Rates: Each parameter gets its own adaptive learning rate.
Robustness: Handles noisy gradients and non-stationary objectives well.
Efficiency: Computationally efficient and requires little memory.
Default Choice: Currently the most popular optimizer in Deep Learning.

Connections

Combines: Momentum and RMSProp
Alternative to: SGD, Adagrad
Used for: Training Neural Networks

Appears In

Deep Learning Foundations
RL and IR optimization sections

Study Notes

Explorer

Adam

Adam

Definition

The Update Rule

Key Advantages

Connections

Appears In

Graph View

Table of Contents

Backlinks