A weight initialization strategy that preserves consistent activation variance across neural network layers.
Variance scaling is a family of weight initialization strategies designed to ensure that the statistical variance of activations remains stable as signals propagate forward through a neural network, and that gradients remain stable as they propagate backward during training. Without careful initialization, deep networks are prone to the vanishing gradient problem—where gradients shrink exponentially toward zero—or the exploding gradient problem, where they grow uncontrollably. Both pathologies make learning slow, unstable, or impossible, and variance scaling directly addresses them by calibrating the initial magnitude of weights relative to the number of inputs and outputs of each layer.
The core idea is to draw initial weights from a distribution whose variance is a function of a layer's fan-in (number of incoming connections), fan-out (number of outgoing connections), or both. Xavier/Glorot initialization, introduced in 2010, sets variance proportional to 2/(fan_in + fan_out), making it well-suited for layers with symmetric activation functions like tanh. He initialization, introduced in 2015, sets variance proportional to 2/fan_in, which accounts for the asymmetry of ReLU activations that effectively zero out half of all inputs. LeCun initialization uses 1/fan_in and was designed with sigmoid-like activations in mind. These are all specific parameterizations of the same general variance scaling principle, and modern deep learning frameworks expose them through a unified API that lets practitioners select a scaling mode and distribution shape.
Variance scaling matters because the choice of initialization can determine whether a model trains at all, particularly in very deep architectures. Even with techniques like batch normalization that partially mitigate poor initialization, starting from a well-scaled distribution accelerates convergence and often improves final performance. The principle extends beyond fully connected layers to convolutional and recurrent architectures, where the effective fan-in and fan-out must be computed differently. As networks have grown deeper and more complex, variance scaling has remained a foundational tool, embedded by default in most modern frameworks and serving as a baseline against which more sophisticated initialization schemes—such as those derived from dynamical mean-field theory—are compared.