Gradient Descent

Gradient Descent

Gradient Descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The algorithm takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

Update Rule

where:

  • — parameters at iteration
  • — learning rate (step size),
  • — gradient of the loss function with respect to at

Moving Downhill

Imagine standing on a hilly landscape in a fog. To find the bottom, you check the slope under your feet and take a step in the steepest downward direction. The learning rate determines the size of your steps—too large and you might overstep the valley; too small and it will take forever to reach the bottom.

Key Considerations

  • Learning Rate : Crucial hyperparameter. Small steps ensure convergence but are slow. Large steps can cause oscillation or divergence.
  • Local Minima: In non-convex functions, GD can get stuck in local minima rather than the global minimum.
  • Saddle Points: Points where the gradient is zero but are not local extrema. These can significantly slow down GD.
  • Vanishing/Exploding Gradients: Issues in deep networks where gradients become extremely small or large, preventing effective updates.

Connections

Appears In