A technique that normalizes layer inputs to accelerate and stabilize neural network training.
Batch normalization is a technique for improving the training of deep neural networks by normalizing the inputs to each layer across a mini-batch of training examples. Specifically, for each layer, the activations are rescaled so that they have approximately zero mean and unit variance, computed over the current batch. Two learnable parameters — a scale factor and a shift — are then applied, allowing the network to recover any representational power lost through normalization. This process is applied during training and replaced at inference time with statistics accumulated across the full training set.
The primary motivation behind batch normalization is to reduce internal covariate shift — the tendency for the distribution of each layer's inputs to change as the weights of preceding layers are updated during training. By keeping activation distributions more stable, batch normalization allows gradients to flow more reliably through deep networks, mitigating the vanishing and exploding gradient problems that plagued earlier architectures. In practice, this enables the use of significantly higher learning rates and reduces sensitivity to weight initialization, both of which accelerate convergence.
Beyond training speed, batch normalization has a notable regularizing effect. By introducing noise through batch-level statistics, it reduces the network's reliance on any single training example, often allowing practitioners to reduce or eliminate dropout. This dual role — as both a normalization scheme and an implicit regularizer — made batch normalization an immediate and widespread success following its introduction by Sergey Ioffe and Christian Szegedy in 2015.
Batch normalization is now a standard component in a vast range of deep learning architectures, including convolutional networks for image recognition, recurrent models, and generative adversarial networks. Its success also inspired a family of related techniques — including layer normalization, group normalization, and instance normalization — each tailored to settings where mini-batch statistics are unreliable, such as small batch sizes or sequence modeling. Together, these normalization methods represent one of the most impactful classes of architectural innovations in modern deep learning.