Prefer the simplest model that adequately explains the data.
Occam's Razor is a philosophical principle holding that, among competing explanations of equal predictive power, the simplest one should be preferred. In machine learning, this translates directly into a bias toward models with fewer parameters, fewer assumptions, and lower structural complexity. Rather than a hard rule, it serves as a guiding heuristic for model selection — a reminder that unnecessary complexity is a liability, not an asset.
The practical motivation for this principle in ML is the problem of overfitting. A sufficiently complex model can memorize training data, capturing noise rather than the true underlying pattern, and consequently fail to generalize to new examples. Simpler models, by contrast, are less likely to fit spurious structure and tend to perform more reliably on held-out data. This intuition is formalized in several theoretical frameworks, including the Vapnik-Chervonenkis (VC) dimension, which quantifies model capacity and its relationship to generalization error, and minimum description length (MDL), which frames learning as compression — the best model is the one that most compactly encodes both the data and the model itself.
Occam's Razor also underpins many regularization techniques used throughout modern machine learning. L1 and L2 regularization, for instance, penalize model complexity by adding terms to the loss function that discourage large parameter values, effectively enforcing parsimony during training. Bayesian model selection operationalizes the same idea through prior distributions that favor simpler hypotheses, with more complex models required to earn their complexity through substantially better likelihood.
Despite its intuitive appeal, Occam's Razor is not without nuance. The rise of large neural networks has complicated the picture: massively overparameterized models can generalize surprisingly well, a phenomenon that challenges classical notions of the complexity-generalization tradeoff. Modern research into implicit regularization, the lottery ticket hypothesis, and double descent has revealed that the relationship between model size and generalization is richer than the simple principle suggests. Occam's Razor remains a valuable default, but contemporary ML has shown that simplicity and performance do not always align as neatly as the principle implies.