Adam

Definition

Adam (Adaptive Moment Estimation)

Adam is an optimization algorithm for gradient-based optimization of stochastic objective functions. It combines the advantages of Momentum (keeping track of the moving average of gradients) and RMSProp (scaling gradients by a moving average of squared gradients).

The Update Rule

Adam maintains two moving averages (moments):

  1. First Moment (): Mean of gradients (Momentum)
  2. Second Moment (): Uncentered variance of gradients (RMSProp)

After bias correction ( and ), the weights are updated:

Adam Update

where:

  • — learning rate (step size)
  • — decay rates for moment estimates (typically 0.9 and 0.999)
  • — small constant to prevent division by zero (e.g., )

Key Advantages

  • Individual Learning Rates: Each parameter gets its own adaptive learning rate.
  • Robustness: Handles noisy gradients and non-stationary objectives well.
  • Efficiency: Computationally efficient and requires little memory.
  • Default Choice: Currently the most popular optimizer in Deep Learning.

Connections

Appears In

  • Deep Learning Foundations
  • RL and IR optimization sections