A weight initialization scheme that preserves activation variance across deep network layers.
Xavier initialization (also called Glorot initialization) is a method for setting the initial weights of a neural network such that the variance of activations remains roughly constant from layer to layer. Without careful initialization, signals passing forward through a deep network tend to either shrink toward zero or explode toward infinity, making gradient-based learning slow or unstable. Xavier initialization addresses this by drawing weights from a distribution with zero mean and variance scaled by the number of incoming and outgoing connections: specifically, Var(W) = 2 / (fan_in + fan_out), where fan_in and fan_out are the number of input and output units for a given layer. This keeps the forward signal and the backward gradient in a stable range throughout training.
The technique was introduced by Xavier Glorot and Yoshua Bengio in their 2010 paper "Understanding the difficulty of training deep feedforward neural networks," which provided both empirical and theoretical analysis of why naive random initialization fails for deep architectures. The derivation assumes linear activations (or approximately linear behavior near zero), making it particularly well-suited for symmetric activation functions like tanh and sigmoid. For ReLU activations, a related scheme called He initialization adjusts the scaling factor to account for the fact that ReLU zeroes out half its inputs.
Xavier initialization became a standard default in most deep learning frameworks and remains widely used today. Its practical impact was significant: by enabling more stable training from the outset, it helped researchers successfully train deeper networks before techniques like batch normalization became prevalent. Understanding initialization is still relevant because even with normalization layers, poor initialization can slow convergence or cause early training instability. The broader principle — that weight scales should be chosen to preserve signal magnitude across layers — continues to inform initialization strategies for transformers, convolutional networks, and other modern architectures.