Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. P-hacking

P-hacking

Manipulating statistical analysis to produce significant results through selective testing.

Year: 2014Generality: 531
Back to Vocab

P-hacking, also called data dredging, refers to the practice of conducting multiple statistical tests on a dataset and selectively reporting only those results that achieve a conventionally significant p-value, typically below 0.05. Rather than forming a hypothesis and testing it once, a researcher engaged in p-hacking explores many combinations of variables, subgroups, or analytical choices until a favorable result emerges. Because any single test has a baseline false-positive rate, running many tests dramatically inflates the probability that at least one will appear significant purely by chance — a phenomenon known as the multiple comparisons problem.

In machine learning and data science, p-hacking manifests in several familiar forms: tuning model hyperparameters on a held-out test set and reporting the best run as the final result, cherry-picking evaluation metrics after seeing outcomes, or repeatedly splitting data until a validation score looks compelling. These practices are especially tempting in competitive benchmarking environments where small performance gains carry outsized professional rewards. The flexibility of modern ML pipelines — with dozens of architectural choices, preprocessing steps, and training decisions — creates a vast "garden of forking paths" where unconscious bias can quietly steer analysts toward favorable-looking results.

The consequences for reproducibility are severe. Models or findings that appear to outperform baselines in published work frequently fail to replicate on fresh data, contributing to a broader replication crisis that spans psychology, medicine, and increasingly machine learning research. Inflated benchmark scores mislead practitioners about real-world performance and distort resource allocation toward approaches that may not generalize.

Mitigating p-hacking requires structural safeguards: pre-registering hypotheses and evaluation protocols before data collection, enforcing strict train/validation/test separation, applying corrections for multiple comparisons such as the Bonferroni method, and reporting negative results alongside positive ones. Transparency norms — sharing code, data, and all experimental runs rather than just the best — are increasingly recognized as essential to trustworthy ML research. Awareness of p-hacking has grown substantially since the early 2010s as the machine learning community began grappling with reproducibility as a first-class scientific concern.

Related

Related

Cherry Picking
Cherry Picking

Selectively presenting only the most favorable outputs to misrepresent an AI system's true performance.

Generality: 398
Hypothesis Testing
Hypothesis Testing

A statistical framework for making data-driven decisions by evaluating competing claims.

Generality: 875
Reporting Bias
Reporting Bias

A systematic distortion in training data caused by selective omission of outcomes or observations.

Generality: 694
Kaggle Effect
Kaggle Effect

Competition-optimized ML models that fail to generalize beyond their specific contest datasets.

Generality: 101
Unhobbling
Unhobbling

Unlocking latent AI capabilities by removing constraints that limit real-world performance.

Generality: 420
Hyperparameter Tuning
Hyperparameter Tuning

Optimizing model configuration settings that are set before training begins.

Generality: 794