A model with more parameters than available training data points.
An overparameterized model is one in which the number of learnable parameters exceeds the number of training examples available. Classical statistical learning theory predicts this should be catastrophic — a model with more degrees of freedom than data points can perfectly memorize the training set, producing zero training error while failing entirely on new inputs. This intuition, formalized through concepts like the bias-variance tradeoff, long discouraged practitioners from training very large models and motivated techniques like early stopping, dropout, and weight decay as safeguards against overfitting.
The deep learning era fundamentally challenged this view. Models like AlexNet and its successors demonstrated that massively overparameterized neural networks could generalize remarkably well, even without aggressive regularization. Researchers observed a phenomenon now called "double descent": as model capacity increases beyond the interpolation threshold — the point where the model can exactly fit the training data — test error initially rises but then falls again, sometimes reaching lower values than smaller, well-regularized models. This behavior contradicts classical theory and has prompted significant theoretical work to explain it.
Several mechanisms appear to explain why overparameterized models generalize despite their capacity. Stochastic gradient descent and its variants exhibit implicit regularization, tending to find solutions with small norm or low effective complexity among the many global minima available in overparameterized regimes. The loss landscape of deep networks, while non-convex, contains vast flat regions where many equivalent solutions reside, and optimization naturally gravitates toward solutions that generalize. Additionally, overparameterized models may benefit from a form of implicit sparsity, where only a small fraction of parameters actively contribute to any given prediction.
Understanding overparameterization has become central to modern ML theory, with implications for how practitioners choose model size, interpret generalization bounds, and design training procedures. The empirical success of large language models and foundation models — which routinely have billions of parameters trained on comparatively modest datasets — has made this phenomenon not just theoretically interesting but practically essential to understand.