A statistical framework for making data-driven decisions by evaluating competing claims.
Hypothesis testing is a formal statistical procedure for determining whether observed data provides sufficient evidence to reject a baseline assumption, known as the null hypothesis, in favor of an alternative. The process involves defining two competing hypotheses, selecting an appropriate test statistic, computing a p-value or confidence interval from the data, and comparing the result against a predetermined significance threshold (commonly α = 0.05). If the p-value falls below this threshold, the null hypothesis is rejected, suggesting the observed effect is unlikely to have arisen by chance alone.
In machine learning, hypothesis testing serves several important roles. It is used to compare model performance — for example, determining whether one classifier significantly outperforms another on a benchmark dataset rather than simply appearing better due to random variation in evaluation splits. Tests such as the paired t-test, McNemar's test, and the Wilcoxon signed-rank test are commonly applied in this context. Hypothesis testing also underpins feature selection, where statistical tests help identify which input variables have a meaningful relationship with the target variable, and A/B testing frameworks used to evaluate deployed model updates in production environments.
Beyond model evaluation, hypothesis testing informs the broader scientific rigor of ML research. As the field has grown, concerns about reproducibility and inflated performance claims have made principled statistical comparison increasingly important. Researchers use hypothesis tests to guard against overfitting to benchmark leaderboards and to ensure that reported improvements reflect genuine advances rather than noise. Corrections for multiple comparisons, such as the Bonferroni correction, are also relevant when many models or hyperparameter configurations are evaluated simultaneously.
Despite its utility, hypothesis testing has well-known limitations in ML contexts. P-values do not measure effect size or practical significance, and with large datasets even trivially small differences can become statistically significant. Modern ML practice increasingly complements hypothesis testing with effect size estimates, confidence intervals, and Bayesian alternatives to provide a fuller picture of model comparison and experimental results.