Stochastic Gradient Descent

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is a variation of Gradient Descent that replaces the actual gradient (calculated from the entire dataset) with an estimate thereof (calculated from a randomly selected subset or a single sample). This reduces the computational burden, allowing for faster iterations and online learning.

SGD Update Rule

For a single sample $(x_{i}, y_{i})$ : $w \leftarrow w - α \nabla L (w; x_{i}, y_{i})$

For a mini-batch $B$ : $w \leftarrow w - α \frac{1}{∣ B ∣} \sum_{i \in B} \nabla L (w; x_{i}, y_{i})$

Efficiency through Estimation

Regular gradient descent requires a full pass over the dataset for every single update. In large-scale machine learning, this is prohibitively slow. SGD assumes that the gradient from a small, representative sample is “good enough” to point in the general direction of the minimum. The noise introduced by the sampling can actually help the optimizer “jump” out of shallow local minima.

Properties

Speed: Much faster iterations than batch gradient descent.
Noise: The path to the minimum is “noisy” and zig-zags, but it eventually converges (given a decreasing learning rate).
Online Learning: Naturally supports learning from a continuous stream of data without needing to store the whole dataset.
Regularization Effect: The inherent noise in SGD can provide a form of implicit regularization, often leading to better generalization.

Connections

Variant of: Gradient Descent
Optimization backbone for: Neural Networks, Deep Reinforcement Learning
RL context: Essential for On-policy distribution updates in large state spaces.

Study Notes

Explorer

Stochastic Gradient Descent

Stochastic Gradient Descent

Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks