An adaptive optimizer that computes per-parameter learning rates using gradient moment estimates.
Adam, short for Adaptive Moment Estimation, is an optimization algorithm widely used to train deep learning models by computing adaptive learning rates for each parameter individually. Introduced by Diederik P. Kingma and Jimmy Ba in their 2014 paper, it synthesizes ideas from two earlier gradient descent variants: AdaGrad, which adapts learning rates based on historical gradient magnitudes and excels with sparse gradients, and RMSProp, which uses an exponentially decaying average of squared gradients to handle non-stationary objectives. By combining these approaches, Adam achieves robust performance across a broad range of architectures and problem types.
The algorithm maintains two running estimates, referred to as the first and second moments of the gradients. The first moment (m) is an exponentially decaying average of past gradients, capturing the mean direction of recent updates. The second moment (v) is an exponentially decaying average of past squared gradients, capturing the uncentered variance. At each training step, these estimates are used to scale the learning rate for each parameter: parameters with consistently large gradients receive smaller effective updates, while parameters with small or infrequent gradients receive larger ones. A bias-correction step is applied to both moment estimates early in training to counteract their initialization at zero, which would otherwise cause artificially small updates.
Adam's practical appeal lies in its combination of computational efficiency, low memory requirements, and strong out-of-the-box performance with minimal hyperparameter tuning. The default settings — a learning rate of 0.001 and decay rates β₁ = 0.9, β₂ = 0.999 — work well across many tasks, making it a reliable default choice for practitioners. It has become especially prevalent in natural language processing, computer vision, and generative modeling, underpinning the training of large-scale models such as transformers and diffusion networks.
Despite its widespread adoption, Adam is not without limitations. Research has shown it can converge to suboptimal solutions in some settings compared to well-tuned stochastic gradient descent with momentum, and it may generalize less well on certain tasks. These observations have spurred the development of variants such as AdamW, which decouples weight decay from the gradient update, and RAdam, which addresses variance issues during early training. Nonetheless, Adam remains one of the most influential and commonly used optimizers in modern machine learning.