Adam
Definition
Adam (Adaptive Moment Estimation)
Adam is an optimization algorithm for gradient-based optimization of stochastic objective functions. It combines the advantages of Momentum (keeping track of the moving average of gradients) and RMSProp (scaling gradients by a moving average of squared gradients).
The Update Rule
Adam maintains two moving averages (moments):
- First Moment (): Mean of gradients (Momentum)
- Second Moment (): Uncentered variance of gradients (RMSProp)
After bias correction ( and ), the weights are updated:
Adam Update
where:
- — learning rate (step size)
- — decay rates for moment estimates (typically 0.9 and 0.999)
- — small constant to prevent division by zero (e.g., )
Key Advantages
- Individual Learning Rates: Each parameter gets its own adaptive learning rate.
- Robustness: Handles noisy gradients and non-stationary objectives well.
- Efficiency: Computationally efficient and requires little memory.
- Default Choice: Currently the most popular optimizer in Deep Learning.
Connections
- Combines: Momentum and RMSProp
- Alternative to: SGD, Adagrad
- Used for: Training Neural Networks
Appears In
- Deep Learning Foundations
- RL and IR optimization sections