A regularization method that penalizes large weights to prevent overfitting.
Weight decay is a regularization technique applied during neural network training that discourages the model from learning overly large parameter values. It works by adding a penalty term to the loss function — typically the sum of squared weights multiplied by a small scalar hyperparameter (λ) — so that the optimizer must balance minimizing prediction error against keeping weights small. This penalty is mathematically equivalent to L2 regularization, and in practice it biases the network toward simpler, smoother solutions that tend to generalize better to unseen data.
During gradient-based optimization, the effect of weight decay manifests as an additional gradient term that nudges each weight slightly toward zero at every update step. This is where the name originates: weights "decay" a little with each iteration unless the loss gradient actively pushes them in the opposite direction. The strength of this decay is controlled by λ — too small and regularization has little effect; too large and the model underfits by suppressing useful learned structure. Choosing an appropriate λ is typically done through cross-validation or hyperparameter search.
Weight decay is particularly valuable because overfitting is a pervasive challenge in deep learning, where models with millions of parameters can easily memorize training data rather than learn generalizable patterns. By constraining weight magnitudes, weight decay implicitly limits the effective complexity of the model, reducing variance at the cost of a small increase in bias — a trade-off that usually improves real-world performance. It is broadly applicable across architectures, including fully connected networks, convolutional neural networks, and transformers.
An important distinction emerged with modern optimizers: when using adaptive methods like Adam, applying L2 regularization to the loss is not equivalent to true weight decay applied directly to the weights. This subtlety, formalized in the AdamW algorithm, has practical consequences for training large models. Today, weight decay remains one of the most widely used and reliable regularization strategies in deep learning, often combined with techniques like dropout and data augmentation for best results.