Stochastic Gradient Descent

Introduction

Stochastic Gradient Descent (SGD) is an iterative optimization algorithm widely used in machine learning for minimizing objective functions, especially loss functions in supervised learning. It approximates gradients using randomly sampled subsets of data, enabling scalable training on large datasets. The method balances computational efficiency and convergence speed by trading exact gradient computation for noisy but fast updates.

"The power of stochastic gradient descent lies in its simplicity and scalability, powering modern machine learning from linear models to deep neural networks." -- Léon Bottou

Background and Motivation

Gradient Descent Basics

Gradient descent iteratively updates parameters to minimize a differentiable function. Full-batch gradient descent computes gradients over entire datasets, ensuring exact descent directions but incurring high computational cost.

Challenges with Large Datasets

Full-batch methods become impractical with large-scale data due to memory and time constraints. Incremental updates using subsets can alleviate these bottlenecks.

Stochastic Approximation

SGD replaces full gradient with gradient of a single randomly selected sample or small batch. This stochastic approximation reduces per-iteration cost and introduces noise that can help escape local minima.

Algorithm Description

Basic Algorithm Steps

Initialize parameters θ₀. For iteration t = 1 to T:
- Randomly select sample (xᵢ, yᵢ)
- Compute gradient gₜ = ∇θ L(θₜ₋₁; xᵢ, yᵢ)
- Update θₜ = θₜ₋₁ - ηₜ gₜ

Mathematical Formulation

θₜ₊₁ = θₜ - ηₜ ∇θ L(θₜ; xᵢ, yᵢ)

Loss Function Dependence

Loss L measures prediction error. SGD minimizes expected loss by iterative noisy gradient steps.

Convergence Properties

Conditions for Convergence

Convergence requires diminishing learning rates ηₜ→0 satisfying ∑ ηₜ = ∞ and ∑ ηₜ² < ∞. Assumes convex loss and unbiased gradient estimates.

Convergence Rate

SGD convergence rate: O(1/√T) for convex problems, slower than full gradient descent but more scalable.

Non-Convex Settings

In non-convex optimization (e.g., deep learning), SGD may converge to local minima or saddle points. Noise helps exploration.

Learning Rate Strategies

Constant Learning Rate

Fixed η throughout training. Simple but may cause oscillations or slow convergence.

Decay Schedules

Gradually reduce η over iterations: step decay, exponential decay, or inverse scaling.

Adaptive Learning Rates

Methods like AdaGrad, RMSProp, Adam adjust η based on past gradients to improve convergence.

Mini-Batch Gradient Descent

Definition

Compromise between SGD and batch gradient descent: compute gradient over small random subset ("mini-batch") of size m.

Benefits

Reduces variance of gradient estimate, enables vectorized computation, balances speed and stability.

Typical Batch Sizes

Common mini-batch sizes range 32 to 512 samples, depending on hardware and problem scale.

Batch Size	Characteristics
1 (SGD)	High noise, fast updates, low memory
32-512	Balanced noise and stability, efficient on GPUs
Full batch	Exact gradients, slow, high memory

Advantages and Limitations

Advantages

Low per-iteration cost, scalable to large data, simple implementation, ability to escape shallow local minima due to noise.

Limitations

High variance in gradient estimates, sensitivity to learning rate tuning, slower convergence on some problems.

Mitigation Techniques

Learning rate schedules, momentum, adaptive methods, mini-batching reduce limitations effectively.

Applications in Machine Learning

Supervised Learning

Training linear regression, logistic regression, support vector machines, neural networks.

Deep Learning

Core optimizer for training convolutional, recurrent, and transformer-based neural architectures.

Reinforcement Learning

Used in policy gradient methods, actor-critic algorithms for parameter updates with sampled trajectories.

Variants and Extensions

Momentum SGD

Incorporates velocity term to accelerate convergence and smooth updates.

Adaptive Methods

AdaGrad, RMSProp, Adam adjust learning rates per parameter based on gradient statistics.

Proximal and Projected SGD

Handles constraints and regularization via proximal operators or projection steps.

Implementation Details

Algorithm Pseudocode

Initialize θ₀for t in 1 to T: Sample (xᵢ, yᵢ) from data Compute gradient gₜ = ∇θ L(θₜ₋₁; xᵢ, yᵢ) Update θₜ = θₜ₋₁ - ηₜ gₜreturn θ_T

Computational Complexity

Per-iteration cost: O(p) where p = number of parameters; independent of dataset size for single sample.

Practical Considerations

Shuffle data each epoch, implement early stopping, monitor validation loss to avoid overfitting.

Performance Considerations

Effect of Noise

Noise in gradient estimates aids escaping saddle points but may slow convergence near optimum.

Hardware Acceleration

Mini-batching enables vectorized GPU computations, increasing throughput substantially.

Parallel and Distributed SGD

Variants exist to parallelize SGD across multiple processors or machines with synchronous or asynchronous updates.

Comparisons with Other Optimizers

Batch Gradient Descent

Uses full dataset gradients, more stable but slower per iteration.

Momentum-Based Methods

Accelerate SGD by using past updates, reducing oscillations in ravines.

Adaptive Optimizers

Adjust learning rates dynamically, often outperform SGD on sparse or noisy data.

Optimizer	Strengths	Weaknesses
SGD	Simple, scalable, good generalization	Requires careful tuning, noisy updates
Momentum SGD	Faster convergence, reduced oscillations	Additional hyperparameters
Adam	Adaptive rates, good for sparse data	May generalize worse on some tasks

References

L. Bottou, "Large-Scale Machine Learning with Stochastic Gradient Descent," Proceedings of COMPSTAT, 2010, pp. 177-186.
S. Shalev-Shwartz and S. Ben-David, "Understanding Machine Learning: From Theory to Algorithms," Cambridge University Press, 2014, pp. 89-110.
Y. LeCun, L. Bottou, G. Orr, and K.-R. Müller, "Efficient BackProp," Neural Networks: Tricks of the Trade, Springer, 2012, pp. 9-48.
D. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," ICLR, 2015, pp. 1-15.
R. Johnson and T. Zhang, "Accelerating Stochastic Gradient Descent using Predictive Variance Reduction," NIPS, 2013, pp. 315-323.