Setting neural network weights before training to ensure stable, effective learning.
Weight initialization is the process of assigning starting values to a neural network's learnable parameters before training begins. Because gradient-based optimization is highly sensitive to where in parameter space the search starts, these initial values can determine whether training converges quickly, converges slowly, or fails entirely. A network initialized with values that are too large can cause activations and gradients to explode exponentially through the layers, while values that are too small cause them to vanish — both pathologies prevent meaningful learning, especially in deep architectures.
The core challenge is maintaining stable signal flow in both the forward pass (activations) and the backward pass (gradients) across many layers. Principled initialization schemes address this by scaling initial weights according to the number of incoming and outgoing connections in each layer. Xavier (Glorot) initialization, introduced in 2010, draws weights from a distribution whose variance is inversely proportional to the average fan-in and fan-out, keeping signal variance roughly constant across layers with symmetric activations like tanh. He initialization, proposed in 2015, adapts this idea for ReLU activations, which zero out half their inputs, and scales variance by the fan-in alone to compensate.
Beyond these standard schemes, researchers have explored orthogonal initialization — starting weight matrices as random orthogonal matrices to preserve gradient norms exactly — and more recent techniques like LSUV (layer-sequential unit-variance initialization), which iteratively rescales weights until each layer's output has unit variance on a mini-batch. For specific architectures such as transformers and residual networks, specialized initialization strategies account for skip connections and attention scaling to prevent instability at the start of training.
Weight initialization matters most in deep networks, where poor choices compound across layers and can waste significant compute before training stabilizes. With modern techniques like batch normalization and adaptive optimizers, networks have become somewhat more forgiving of suboptimal initialization, but careful initialization still accelerates convergence, reduces sensitivity to learning rate choice, and can meaningfully affect the quality of the final trained model.