Normalizes activations across features within a layer to stabilize neural network training.
Layer normalization is a technique that standardizes the inputs to a neural network layer by computing the mean and variance across all features for each individual training example, then rescaling the result using learned parameters. Unlike batch normalization, which normalizes across the batch dimension and therefore depends on statistics computed over multiple samples, layer normalization operates entirely within a single example. This makes it independent of batch size and well-suited to settings where batches are small, variable, or effectively size-one — as is common in recurrent neural networks and autoregressive language models.
In practice, for a given layer with activations x, layer normalization computes the mean μ and variance σ² across all neurons in that layer, then produces normalized outputs as (x − μ) / √(σ² + ε). Two learnable parameters — a scale γ and a shift β — are applied element-wise afterward, allowing the network to recover any representational capacity that raw normalization might remove. The small constant ε prevents division by zero. This process is applied independently at every layer and for every input, with no dependence on other examples in a batch.
Layer normalization became a cornerstone of transformer architectures after its introduction in 2016. In models like BERT, GPT, and their successors, it is applied either before or after the attention and feed-forward sublayers, helping to smooth the loss landscape and accelerate convergence. Its position within the residual stream — whether pre-norm or post-norm — has itself become an active area of study, with pre-norm variants generally offering more stable training at large scale.
The broader significance of layer normalization lies in its role enabling the training of very deep and very large models. By reducing internal covariate shift — the tendency for the distribution of layer inputs to drift as earlier weights update — it allows higher learning rates and reduces sensitivity to weight initialization. These properties have made it effectively indispensable in modern large language models and vision transformers, where training stability at scale is a primary engineering concern.