An information-theoretic principle selecting the model that most compresses data plus its own description.
Minimum Description Length (MDL) is a model selection principle grounded in information theory that operationalizes Occam's Razor: the best explanation for a dataset is the one that produces the shortest combined description of both the model and the data encoded under that model. Formally, given a set of candidate models, MDL selects the model H that minimizes L(H) + L(D|H), where L(H) is the code length needed to describe the hypothesis and L(D|H) is the code length needed to describe the data given that hypothesis. This two-part framing forces a principled trade-off — a more complex model may fit the data better and reduce L(D|H), but its increased complexity raises L(H), so only genuinely useful structure earns its representational cost.
MDL connects deeply to several foundational ideas in machine learning and statistics. It has strong ties to Bayesian inference, where minimizing description length corresponds to maximizing a penalized posterior probability, and to Kolmogorov complexity, which provides the theoretical ideal of the shortest possible description of any object. In practice, exact Kolmogorov complexity is uncomputable, so MDL implementations use tractable surrogate code lengths derived from probabilistic models, leading to practical variants such as the Normalized Maximum Likelihood (NML) and prequential (predictive sequential) formulations developed through the 1980s and 1990s.
In machine learning, MDL provides a theoretically motivated alternative to heuristic regularization techniques like L1/L2 penalties or held-out validation for model selection. It has been applied to decision tree pruning, neural architecture selection, feature selection, and clustering, offering a unified lens through which model complexity and generalization are jointly managed. Because MDL penalizes models that encode noise rather than signal, it naturally guards against overfitting without requiring a separate validation set. Its influence extends into modern deep learning discussions around compression-based generalization bounds, making it a durable and increasingly relevant framework as practitioners seek principled explanations for why large models generalize.