An adaptive optimizer combining momentum and per-parameter gradient scaling for efficient deep learning training.
Adam (Adaptive Moment Estimation) is a stochastic gradient descent optimizer introduced by Diederik Kingma and Jimmy Ba in 2014 that has become one of the most widely used training algorithms in deep learning. It works by maintaining two running statistics for each parameter: an exponentially decaying average of past gradients (the first moment, capturing direction and momentum) and an exponentially decaying average of past squared gradients (the second moment, capturing per-parameter gradient variance). At each step, bias-corrected versions of these estimates are computed as m̂_t = m_t/(1−β1^t) and v̂_t = v_t/(1−β2^t), and parameters are updated via θ_{t+1} = θ_t − α · m̂_t / (√v̂_t + ε). This formulation effectively gives each parameter its own adaptive learning rate, automatically scaling down updates for parameters with large or frequent gradients and scaling up updates for those with small or infrequent ones.
The practical appeal of Adam lies in its robustness and low sensitivity to hyperparameter choice. Default settings (learning rate α ≈ 1e-3, β1 = 0.9, β2 = 0.999, ε = 1e-8) work well across a remarkably wide range of architectures and tasks, from convolutional networks to recurrent models to Transformers, making it a reliable default in most ML workflows. It conceptually unifies two earlier ideas—momentum-based SGD, which smooths updates along consistent gradient directions, and RMSprop, which normalizes updates by recent gradient magnitude—into a single coherent algorithm with bias correction that matters most in early training steps.
Despite its popularity, Adam has well-documented limitations. It requires double the memory of vanilla SGD to store the moment vectors, can fail to converge in certain convex settings (motivating the AMSGrad variant), and sometimes generalizes worse than SGD with momentum on image classification benchmarks. A particularly impactful fix was AdamW, which decouples weight decay from the gradient update to correct an interaction between L2 regularization and adaptive scaling. In practice, Adam is often paired with learning rate warmup and scheduling, and remains the default optimizer for training large language models and other large-scale deep learning systems.