Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Weight Initialization

Weight Initialization

Setting neural network weights before training to ensure stable, effective learning.

Year: 2010Generality: 650
Back to Vocab

Weight initialization is the process of assigning starting values to a neural network's learnable parameters before training begins. Because gradient-based optimization is highly sensitive to where in parameter space the search starts, these initial values can determine whether training converges quickly, converges slowly, or fails entirely. A network initialized with values that are too large can cause activations and gradients to explode exponentially through the layers, while values that are too small cause them to vanish — both pathologies prevent meaningful learning, especially in deep architectures.

The core challenge is maintaining stable signal flow in both the forward pass (activations) and the backward pass (gradients) across many layers. Principled initialization schemes address this by scaling initial weights according to the number of incoming and outgoing connections in each layer. Xavier (Glorot) initialization, introduced in 2010, draws weights from a distribution whose variance is inversely proportional to the average fan-in and fan-out, keeping signal variance roughly constant across layers with symmetric activations like tanh. He initialization, proposed in 2015, adapts this idea for ReLU activations, which zero out half their inputs, and scales variance by the fan-in alone to compensate.

Beyond these standard schemes, researchers have explored orthogonal initialization — starting weight matrices as random orthogonal matrices to preserve gradient norms exactly — and more recent techniques like LSUV (layer-sequential unit-variance initialization), which iteratively rescales weights until each layer's output has unit variance on a mini-batch. For specific architectures such as transformers and residual networks, specialized initialization strategies account for skip connections and attention scaling to prevent instability at the start of training.

Weight initialization matters most in deep networks, where poor choices compound across layers and can waste significant compute before training stabilizes. With modern techniques like batch normalization and adaptive optimizers, networks have become somewhat more forgiving of suboptimal initialization, but careful initialization still accelerates convergence, reduces sensitivity to learning rate choice, and can meaningfully affect the quality of the final trained model.

Related

Related

Initialization
Initialization

Setting a neural network's starting parameter values before training begins.

Generality: 729
Xavier Initialization
Xavier Initialization

A weight initialization scheme that preserves activation variance across deep network layers.

Generality: 620
Variance Scaling
Variance Scaling

A weight initialization strategy that preserves consistent activation variance across neural network layers.

Generality: 620
Weight
Weight

A learnable parameter that scales the influence of inputs within a model.

Generality: 850
Vanishing Gradient
Vanishing Gradient

A training failure where gradients shrink exponentially, preventing early network layers from learning.

Generality: 720
Batch Normalization
Batch Normalization

A technique that normalizes layer inputs to accelerate and stabilize neural network training.

Generality: 794