Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Xavier Initialization

Xavier Initialization

A weight initialization scheme that preserves activation variance across deep network layers.

Year: 2010Generality: 620
Back to Vocab

Xavier initialization (also called Glorot initialization) is a method for setting the initial weights of a neural network such that the variance of activations remains roughly constant from layer to layer. Without careful initialization, signals passing forward through a deep network tend to either shrink toward zero or explode toward infinity, making gradient-based learning slow or unstable. Xavier initialization addresses this by drawing weights from a distribution with zero mean and variance scaled by the number of incoming and outgoing connections: specifically, Var(W) = 2 / (fan_in + fan_out), where fan_in and fan_out are the number of input and output units for a given layer. This keeps the forward signal and the backward gradient in a stable range throughout training.

The technique was introduced by Xavier Glorot and Yoshua Bengio in their 2010 paper "Understanding the difficulty of training deep feedforward neural networks," which provided both empirical and theoretical analysis of why naive random initialization fails for deep architectures. The derivation assumes linear activations (or approximately linear behavior near zero), making it particularly well-suited for symmetric activation functions like tanh and sigmoid. For ReLU activations, a related scheme called He initialization adjusts the scaling factor to account for the fact that ReLU zeroes out half its inputs.

Xavier initialization became a standard default in most deep learning frameworks and remains widely used today. Its practical impact was significant: by enabling more stable training from the outset, it helped researchers successfully train deeper networks before techniques like batch normalization became prevalent. Understanding initialization is still relevant because even with normalization layers, poor initialization can slow convergence or cause early training instability. The broader principle — that weight scales should be chosen to preserve signal magnitude across layers — continues to inform initialization strategies for transformers, convolutional networks, and other modern architectures.

Related

Related

Weight Initialization
Weight Initialization

Setting neural network weights before training to ensure stable, effective learning.

Generality: 650
Variance Scaling
Variance Scaling

A weight initialization strategy that preserves consistent activation variance across neural network layers.

Generality: 620
Initialization
Initialization

Setting a neural network's starting parameter values before training begins.

Generality: 729
Vanishing Gradient
Vanishing Gradient

A training failure where gradients shrink exponentially, preventing early network layers from learning.

Generality: 720
Layer Normalization
Layer Normalization

Normalizes activations across features within a layer to stabilize neural network training.

Generality: 731
Batch Normalization
Batch Normalization

A technique that normalizes layer inputs to accelerate and stabilize neural network training.

Generality: 794