Setting a neural network's starting parameter values before training begins.
Initialization refers to the process of assigning starting values to a neural network's learnable parameters — its weights and biases — before any training data is processed. Far from being a trivial detail, the choice of initial values profoundly shapes whether and how quickly a network learns. Optimization algorithms like stochastic gradient descent begin their search for good parameters from this starting point, and a poor initialization can doom training before it begins.
The central challenge initialization must address is the problem of vanishing and exploding gradients. In deep networks, gradients are propagated backward through many layers via the chain rule of calculus. If weights are initialized too small, these gradients shrink exponentially with depth, leaving early layers unable to learn. If weights are too large, gradients explode, causing unstable updates. Naive approaches — such as setting all weights to zero or drawing from an arbitrary distribution — reliably produce one of these failure modes. Zero initialization is especially problematic because it causes all neurons in a layer to compute identical outputs and receive identical gradient updates, preventing the network from breaking symmetry and learning diverse features.
The modern solution is to scale initial weights according to the architecture. Xavier (Glorot) initialization, introduced in 2010, draws weights from a distribution whose variance is calibrated to the number of incoming and outgoing connections in a layer, keeping signal variance roughly constant across layers. He initialization, proposed in 2015, adapts this principle for layers using ReLU activations, which zero out half their inputs and therefore require a different scaling factor. Both methods are now standard defaults in major deep learning frameworks and are applied automatically unless overridden.
Beyond these classical schemes, initialization remains an active research area. Techniques like orthogonal initialization, which sets weight matrices to be orthogonal transformations, and data-driven approaches that use a small forward pass to calibrate scales have shown benefits in specific architectures. The success of very deep networks — including transformers and residual networks — depends partly on careful initialization strategies that keep training stable from the very first gradient step.