Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Gradient Clipping

Gradient Clipping

A training technique that prevents exploding gradients by capping gradient magnitudes.

Year: 2013Generality: 694
Back to Vocab

Gradient clipping is a technique used during neural network training to prevent the exploding gradient problem, where gradients grow exponentially large during backpropagation and cause catastrophically large parameter updates. Left unchecked, exploding gradients can destabilize training, produce NaN values, or prevent a model from converging altogether. Gradient clipping addresses this by imposing a hard constraint on gradient magnitudes before they are applied to model weights.

There are two primary variants of gradient clipping. The first, value clipping, simply clips each individual gradient component to lie within a fixed range, such as [-1, 1]. The second and more widely used approach, norm clipping, computes the global L2 norm of all gradients and rescales the entire gradient vector if that norm exceeds a specified threshold. Norm clipping preserves the direction of the gradient while controlling its magnitude, making it a more principled intervention that avoids distorting the relative scale of individual parameter updates.

Gradient clipping is especially important for recurrent neural networks (RNNs), which process sequential data by unrolling through many time steps. This unrolling effectively creates very deep computational graphs, making RNNs particularly susceptible to both exploding and vanishing gradients. The technique became a standard component of RNN training pipelines and remains essential for modern sequence models, including LSTMs and transformer architectures trained on long contexts. It is also commonly applied in reinforcement learning, where reward signals can produce highly variable gradient magnitudes.

Beyond preventing numerical instability, gradient clipping has a regularizing effect that can improve generalization by preventing any single large gradient update from dominating the learning trajectory. Most modern deep learning frameworks implement gradient clipping as a built-in utility, making it straightforward to apply. Choosing the right clipping threshold is typically done empirically, with practitioners monitoring gradient norms during training to identify appropriate values. Its simplicity and effectiveness have made gradient clipping a near-universal tool in the training of large, deep models.

Related

Related

Vanishing Gradient
Vanishing Gradient

A training failure where gradients shrink exponentially, preventing early network layers from learning.

Generality: 720
Gradient Descent
Gradient Descent

An iterative optimization algorithm that minimizes a function by following its steepest downhill direction.

Generality: 909
CLIP (Contrastive Language–Image Pre-training)
CLIP (Contrastive Language–Image Pre-training)

OpenAI model that learns visual concepts by aligning images with natural language descriptions.

Generality: 703
Gradient Noise Scale
Gradient Noise Scale

A metric quantifying signal-to-noise ratio in stochastic gradient descent updates.

Generality: 339
Dropout
Dropout

A regularization technique that randomly deactivates neurons during training to prevent overfitting.

Generality: 796
Saturating Non-Linearities
Saturating Non-Linearities

Activation functions whose outputs plateau and stop responding to large input values.

Generality: 581