A reference model used to benchmark whether new AI approaches actually improve performance.
In machine learning, a baseline is a reference point against which the performance of new models, algorithms, or techniques is measured. Rather than evaluating a model in isolation, researchers compare it to a baseline to determine whether their innovations produce genuine improvements. Baselines range from trivially simple heuristics — such as always predicting the majority class in a classification task — to strong prior methods that represent the current state of the art. Choosing an appropriate baseline is as important as designing the model itself: a baseline that is too weak makes modest gains look impressive, while one that is too strong can obscure real progress.
Baselines function by establishing a performance floor using a well-understood, reproducible method on the same dataset and evaluation metric as the proposed approach. Common choices include random predictors, rule-based systems, linear models, or the best previously published result on a given benchmark. In reinforcement learning, a baseline often refers specifically to a value function used to reduce variance in policy gradient estimates — a distinct but related usage of the term. In both cases, the baseline provides a stable reference that makes comparisons interpretable and scientifically meaningful.
The importance of baselines grew substantially with the proliferation of public benchmarks and ML competitions in the 2010s. Datasets like ImageNet, GLUE, and SQuAD formalized the practice of reporting results relative to established reference points, making it easier to track genuine progress across the research community. Without strong baselines, it becomes difficult to distinguish meaningful advances from results that merely exploit quirks in evaluation protocols or benefit from additional compute and data.
Beyond academic research, baselines are equally critical in applied ML settings. When deploying a new recommendation system or fraud detection model, practitioners compare against the existing production system as a baseline to justify the cost and risk of switching. A well-chosen baseline enforces intellectual honesty, guards against overfitting to benchmarks, and ensures that reported improvements translate to real-world value.