An iterative optimization algorithm that minimizes a function by following its steepest downhill direction.
Gradient descent is the workhorse optimization algorithm of modern machine learning, used to tune the parameters of models—from simple linear regression to deep neural networks with billions of weights. The core idea is straightforward: given a loss function that measures how poorly a model performs, compute the gradient of that loss with respect to each parameter. The gradient points in the direction of steepest increase, so moving in the opposite direction—downhill—reduces the loss. This update is scaled by a learning rate, a hyperparameter that controls step size. Repeat this process across many iterations, and the parameters converge toward values that minimize the loss.
In practice, three main variants dominate. Batch gradient descent computes the gradient over the entire dataset before each update, which is stable but computationally expensive. Stochastic gradient descent (SGD) updates parameters after each individual training example, introducing noise that can actually help escape shallow local minima. Mini-batch gradient descent strikes a balance, computing gradients over small random subsets of data—this is the standard approach in deep learning, combining computational efficiency with enough stochasticity to navigate complex loss landscapes. Modern optimizers like Adam, RMSProp, and AdaGrad build on SGD by adapting the learning rate per parameter, dramatically improving convergence in practice.
Gradient descent became central to machine learning through its pairing with backpropagation, the algorithm that efficiently computes gradients through layered neural networks using the chain rule of calculus. Without an efficient way to compute gradients, training deep networks would be computationally intractable. Together, backpropagation and gradient descent form the foundation of nearly all neural network training pipelines used today.
Despite its ubiquity, gradient descent has well-known limitations. It can get trapped in local minima or saddle points, and its performance is highly sensitive to the choice of learning rate. In high-dimensional non-convex loss landscapes—typical of deep learning—convergence is not guaranteed to a global optimum. Nevertheless, empirical results consistently show that gradient-based optimization finds solutions that generalize remarkably well, making it indispensable to the field.