Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Layer Normalization

Layer Normalization

Normalizes activations across features within a layer to stabilize neural network training.

Year: 2016Generality: 731
Back to Vocab

Layer normalization is a technique that standardizes the inputs to a neural network layer by computing the mean and variance across all features for each individual training example, then rescaling the result using learned parameters. Unlike batch normalization, which normalizes across the batch dimension and therefore depends on statistics computed over multiple samples, layer normalization operates entirely within a single example. This makes it independent of batch size and well-suited to settings where batches are small, variable, or effectively size-one — as is common in recurrent neural networks and autoregressive language models.

In practice, for a given layer with activations x, layer normalization computes the mean μ and variance σ² across all neurons in that layer, then produces normalized outputs as (x − μ) / √(σ² + ε). Two learnable parameters — a scale γ and a shift β — are applied element-wise afterward, allowing the network to recover any representational capacity that raw normalization might remove. The small constant ε prevents division by zero. This process is applied independently at every layer and for every input, with no dependence on other examples in a batch.

Layer normalization became a cornerstone of transformer architectures after its introduction in 2016. In models like BERT, GPT, and their successors, it is applied either before or after the attention and feed-forward sublayers, helping to smooth the loss landscape and accelerate convergence. Its position within the residual stream — whether pre-norm or post-norm — has itself become an active area of study, with pre-norm variants generally offering more stable training at large scale.

The broader significance of layer normalization lies in its role enabling the training of very deep and very large models. By reducing internal covariate shift — the tendency for the distribution of layer inputs to drift as earlier weights update — it allows higher learning rates and reduces sensitivity to weight initialization. These properties have made it effectively indispensable in modern large language models and vision transformers, where training stability at scale is a primary engineering concern.

Related

Related

LN (Layer Normalization)
LN (Layer Normalization)

A normalization technique that stabilizes neural network training by standardizing each layer's inputs.

Generality: 731
Batch Normalization
Batch Normalization

A technique that normalizes layer inputs to accelerate and stabilize neural network training.

Generality: 794
Variance Scaling
Variance Scaling

A weight initialization strategy that preserves consistent activation variance across neural network layers.

Generality: 620
Model Layer
Model Layer

A discrete computational stage in a neural network that transforms input representations progressively.

Generality: 794
Weight Initialization
Weight Initialization

Setting neural network weights before training to ensure stable, effective learning.

Generality: 650
Xavier Initialization
Xavier Initialization

A weight initialization scheme that preserves activation variance across deep network layers.

Generality: 620