Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Double Descent

Double Descent

Test error drops, rises, then drops again as model complexity increases.

Year: 2019Generality: 599
Back to Vocab

Double descent describes a counterintuitive pattern in how machine learning models behave as their complexity grows. Classical statistical learning theory predicted a U-shaped bias-variance tradeoff: as model capacity increases, test error first falls as the model learns meaningful structure, then rises as overfitting sets in. Double descent reveals that this picture is incomplete. Beyond the overfitting peak — particularly near the interpolation threshold, where the model has just enough parameters to fit the training data exactly — test error can rise sharply, but then falls again as capacity continues to grow, eventually reaching lower error than before.

The mechanism behind this second descent involves how overparameterized models navigate the space of possible solutions. When a model has far more parameters than training examples, gradient-based optimization tends to find solutions that interpolate the training data while remaining as simple as possible in some implicit sense — a property related to the inductive biases of the optimizer and architecture. These "minimum-norm" solutions often generalize surprisingly well, even though they memorize training labels perfectly. This behavior has been observed across linear models, kernel methods, random forests, and deep neural networks, suggesting it reflects something fundamental about learning rather than a quirk of any particular architecture.

The phenomenon was formally characterized and named in a 2019 paper by Belkin and colleagues, though related observations had appeared in earlier empirical work on neural networks and in classical results on interpolating estimators. The finding prompted a significant reassessment of when and why regularization is necessary, and helped explain why modern deep learning models — which are massively overparameterized by classical standards — generalize as well as they do in practice.

Double descent has practical implications for model selection and training. It suggests that stopping at intermediate model sizes can sometimes yield worse generalization than going much larger, and that the conventional wisdom of "simpler is better" does not always hold. Researchers have since documented epoch-wise double descent, where the same phenomenon appears over training time rather than model size, further enriching the theoretical picture of how overparameterized models learn.

Related

Related

Bias-Variance Dilemma
Bias-Variance Dilemma

The fundamental trade-off between model simplicity and sensitivity to training data.

Generality: 838
Overparameterization Regime
Overparameterization Regime

When a model has more parameters than training samples, yet still generalizes well.

Generality: 520
Bias-Variance Curve
Bias-Variance Curve

A plot showing how model complexity affects the balance between bias and variance.

Generality: 694
Overparameterized
Overparameterized

A model with more parameters than available training data points.

Generality: 590
Symbolic Descent
Symbolic Descent

An optimization method that searches over symbolic programs instead of tuning neural network weights

Generality: 264
Bias-Variance Trade-off
Bias-Variance Trade-off

The fundamental tension between model complexity and generalization that governs prediction error.

Generality: 875