Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Adam Optimizer

Adam Optimizer

An adaptive optimizer combining momentum and per-parameter gradient scaling for efficient deep learning training.

Year: 2014Generality: 796
Back to Vocab

Adam (Adaptive Moment Estimation) is a stochastic gradient descent optimizer introduced by Diederik Kingma and Jimmy Ba in 2014 that has become one of the most widely used training algorithms in deep learning. It works by maintaining two running statistics for each parameter: an exponentially decaying average of past gradients (the first moment, capturing direction and momentum) and an exponentially decaying average of past squared gradients (the second moment, capturing per-parameter gradient variance). At each step, bias-corrected versions of these estimates are computed as m̂_t = m_t/(1−β1^t) and v̂_t = v_t/(1−β2^t), and parameters are updated via θ_{t+1} = θ_t − α · m̂_t / (√v̂_t + ε). This formulation effectively gives each parameter its own adaptive learning rate, automatically scaling down updates for parameters with large or frequent gradients and scaling up updates for those with small or infrequent ones.

The practical appeal of Adam lies in its robustness and low sensitivity to hyperparameter choice. Default settings (learning rate α ≈ 1e-3, β1 = 0.9, β2 = 0.999, ε = 1e-8) work well across a remarkably wide range of architectures and tasks, from convolutional networks to recurrent models to Transformers, making it a reliable default in most ML workflows. It conceptually unifies two earlier ideas—momentum-based SGD, which smooths updates along consistent gradient directions, and RMSprop, which normalizes updates by recent gradient magnitude—into a single coherent algorithm with bias correction that matters most in early training steps.

Despite its popularity, Adam has well-documented limitations. It requires double the memory of vanilla SGD to store the moment vectors, can fail to converge in certain convex settings (motivating the AMSGrad variant), and sometimes generalizes worse than SGD with momentum on image classification benchmarks. A particularly impactful fix was AdamW, which decouples weight decay from the gradient update to correct an interaction between L2 regularization and adaptive scaling. In practice, Adam is often paired with learning rate warmup and scheduling, and remains the default optimizer for training large language models and other large-scale deep learning systems.

Related

Related

Adam (Adaptive Moment Estimation)
Adam (Adaptive Moment Estimation)

An adaptive optimizer that computes per-parameter learning rates using gradient moment estimates.

Generality: 795
Gradient Descent
Gradient Descent

An iterative optimization algorithm that minimizes a function by following its steepest downhill direction.

Generality: 909
Stochastic Optimization
Stochastic Optimization

Optimization methods that use randomness to efficiently find solutions in complex, uncertain problems.

Generality: 820
Step Size
Step Size

A hyperparameter controlling how large each parameter update is during optimization.

Generality: 720
Autograd
Autograd

An automatic differentiation engine that computes gradients for training machine learning models.

Generality: 752
Autodiff
Autodiff

A method to compute exact derivatives automatically, enabling efficient neural network training.

Generality: 795