The length of the shortest program that produces a given string as output.
Kolmogorov complexity is a measure of the information content of a string or dataset, defined as the length of the shortest computer program—written in some fixed universal programming language—that produces that string as its output. A string that contains regular patterns can be described by a short program, giving it low Kolmogorov complexity, while a truly random string cannot be compressed below its own length and therefore has high complexity. This framework provides a rigorous, mathematical definition of randomness and simplicity that does not depend on any particular statistical model or prior assumptions.
One of the concept's most important properties is its near language-independence: the choice of universal programming language affects the complexity measure by at most a constant, making Kolmogorov complexity an intrinsic property of the data itself rather than an artifact of representation. This invariance theorem is what gives the measure its theoretical power. In practice, however, Kolmogorov complexity is uncomputable—no algorithm can calculate the exact shortest program for an arbitrary string—so researchers work with computable approximations, such as standard compression algorithms like gzip or LZW, which serve as practical upper bounds.
In machine learning and AI, Kolmogorov complexity underpins several foundational ideas. It is central to Solomonoff induction, a theoretically optimal framework for sequence prediction that assigns higher prior probability to simpler hypotheses. It also connects directly to the Minimum Description Length (MDL) principle, which formalizes Occam's razor as a model selection criterion: prefer the model that most compresses the data. These ideas influence modern regularization techniques, Bayesian model comparison, and the theoretical analysis of generalization in learning algorithms.
Beyond formal theory, Kolmogorov complexity shapes intuitions about what it means for a model to truly learn structure versus memorize noise. A model that captures genuine regularities in data effectively compresses it, while one that overfits essentially encodes random fluctuations. This perspective has influenced research into neural network compression, meta-learning, and the study of implicit biases in deep learning, where simpler functions are empirically favored even without explicit regularization.