Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Adam (Adaptive Moment Estimation)

Adam (Adaptive Moment Estimation)

An adaptive optimizer that computes per-parameter learning rates using gradient moment estimates.

Year: 2015Generality: 795
Back to Vocab

Adam, short for Adaptive Moment Estimation, is an optimization algorithm widely used to train deep learning models by computing adaptive learning rates for each parameter individually. Introduced by Diederik P. Kingma and Jimmy Ba in their 2014 paper, it synthesizes ideas from two earlier gradient descent variants: AdaGrad, which adapts learning rates based on historical gradient magnitudes and excels with sparse gradients, and RMSProp, which uses an exponentially decaying average of squared gradients to handle non-stationary objectives. By combining these approaches, Adam achieves robust performance across a broad range of architectures and problem types.

The algorithm maintains two running estimates, referred to as the first and second moments of the gradients. The first moment (m) is an exponentially decaying average of past gradients, capturing the mean direction of recent updates. The second moment (v) is an exponentially decaying average of past squared gradients, capturing the uncentered variance. At each training step, these estimates are used to scale the learning rate for each parameter: parameters with consistently large gradients receive smaller effective updates, while parameters with small or infrequent gradients receive larger ones. A bias-correction step is applied to both moment estimates early in training to counteract their initialization at zero, which would otherwise cause artificially small updates.

Adam's practical appeal lies in its combination of computational efficiency, low memory requirements, and strong out-of-the-box performance with minimal hyperparameter tuning. The default settings — a learning rate of 0.001 and decay rates β₁ = 0.9, β₂ = 0.999 — work well across many tasks, making it a reliable default choice for practitioners. It has become especially prevalent in natural language processing, computer vision, and generative modeling, underpinning the training of large-scale models such as transformers and diffusion networks.

Despite its widespread adoption, Adam is not without limitations. Research has shown it can converge to suboptimal solutions in some settings compared to well-tuned stochastic gradient descent with momentum, and it may generalize less well on certain tasks. These observations have spurred the development of variants such as AdamW, which decouples weight decay from the gradient update, and RAdam, which addresses variance issues during early training. Nonetheless, Adam remains one of the most influential and commonly used optimizers in modern machine learning.

Related

Related

Adam Optimizer
Adam Optimizer

An adaptive optimizer combining momentum and per-parameter gradient scaling for efficient deep learning training.

Generality: 796
Gradient Descent
Gradient Descent

An iterative optimization algorithm that minimizes a function by following its steepest downhill direction.

Generality: 909
Stochastic Optimization
Stochastic Optimization

Optimization methods that use randomness to efficiently find solutions in complex, uncertain problems.

Generality: 820
Step Size
Step Size

A hyperparameter controlling how large each parameter update is during optimization.

Generality: 720
Autodiff
Autodiff

A method to compute exact derivatives automatically, enabling efficient neural network training.

Generality: 795
Autograd
Autograd

An automatic differentiation engine that computes gradients for training machine learning models.

Generality: 752