A training technique that prevents exploding gradients by capping gradient magnitudes.
Gradient clipping is a technique used during neural network training to prevent the exploding gradient problem, where gradients grow exponentially large during backpropagation and cause catastrophically large parameter updates. Left unchecked, exploding gradients can destabilize training, produce NaN values, or prevent a model from converging altogether. Gradient clipping addresses this by imposing a hard constraint on gradient magnitudes before they are applied to model weights.
There are two primary variants of gradient clipping. The first, value clipping, simply clips each individual gradient component to lie within a fixed range, such as [-1, 1]. The second and more widely used approach, norm clipping, computes the global L2 norm of all gradients and rescales the entire gradient vector if that norm exceeds a specified threshold. Norm clipping preserves the direction of the gradient while controlling its magnitude, making it a more principled intervention that avoids distorting the relative scale of individual parameter updates.
Gradient clipping is especially important for recurrent neural networks (RNNs), which process sequential data by unrolling through many time steps. This unrolling effectively creates very deep computational graphs, making RNNs particularly susceptible to both exploding and vanishing gradients. The technique became a standard component of RNN training pipelines and remains essential for modern sequence models, including LSTMs and transformer architectures trained on long contexts. It is also commonly applied in reinforcement learning, where reward signals can produce highly variable gradient magnitudes.
Beyond preventing numerical instability, gradient clipping has a regularizing effect that can improve generalization by preventing any single large gradient update from dominating the learning trajectory. Most modern deep learning frameworks implement gradient clipping as a built-in utility, making it straightforward to apply. Choosing the right clipping threshold is typically done empirically, with practitioners monitoring gradient norms during training to identify appropriate values. Its simplicity and effectiveness have made gradient clipping a near-universal tool in the training of large, deep models.