Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Hypothesis Testing

Hypothesis Testing

A statistical framework for making data-driven decisions by evaluating competing claims.

Year: 1984Generality: 875
Back to Vocab

Hypothesis testing is a formal statistical procedure for determining whether observed data provides sufficient evidence to reject a baseline assumption, known as the null hypothesis, in favor of an alternative. The process involves defining two competing hypotheses, selecting an appropriate test statistic, computing a p-value or confidence interval from the data, and comparing the result against a predetermined significance threshold (commonly α = 0.05). If the p-value falls below this threshold, the null hypothesis is rejected, suggesting the observed effect is unlikely to have arisen by chance alone.

In machine learning, hypothesis testing serves several important roles. It is used to compare model performance — for example, determining whether one classifier significantly outperforms another on a benchmark dataset rather than simply appearing better due to random variation in evaluation splits. Tests such as the paired t-test, McNemar's test, and the Wilcoxon signed-rank test are commonly applied in this context. Hypothesis testing also underpins feature selection, where statistical tests help identify which input variables have a meaningful relationship with the target variable, and A/B testing frameworks used to evaluate deployed model updates in production environments.

Beyond model evaluation, hypothesis testing informs the broader scientific rigor of ML research. As the field has grown, concerns about reproducibility and inflated performance claims have made principled statistical comparison increasingly important. Researchers use hypothesis tests to guard against overfitting to benchmark leaderboards and to ensure that reported improvements reflect genuine advances rather than noise. Corrections for multiple comparisons, such as the Bonferroni correction, are also relevant when many models or hyperparameter configurations are evaluated simultaneously.

Despite its utility, hypothesis testing has well-known limitations in ML contexts. P-values do not measure effect size or practical significance, and with large datasets even trivially small differences can become statistically significant. Modern ML practice increasingly complements hypothesis testing with effect size estimates, confidence intervals, and Bayesian alternatives to provide a fuller picture of model comparison and experimental results.

Related

Related

A/B Testing
A/B Testing

A controlled experiment comparing two variants to determine which performs better.

Generality: 774
Kaleidoscope Hypothesis
Kaleidoscope Hypothesis

A framework for evaluating ML models through dynamic, context-sensitive, and semantically grounded testing.

Generality: 94
P-hacking
P-hacking

Manipulating statistical analysis to produce significant results through selective testing.

Generality: 531
Test Set
Test Set

A held-out dataset used to evaluate a trained model's real-world generalization.

Generality: 820
Eval (Evaluation)
Eval (Evaluation)

Measuring an AI model's performance against defined metrics and datasets.

Generality: 838
Universality Hypothesis
Universality Hypothesis

The claim that sufficiently expressive models can approximate any learnable function.

Generality: 720