Manipulating statistical analysis to produce significant results through selective testing.
P-hacking, also called data dredging, refers to the practice of conducting multiple statistical tests on a dataset and selectively reporting only those results that achieve a conventionally significant p-value, typically below 0.05. Rather than forming a hypothesis and testing it once, a researcher engaged in p-hacking explores many combinations of variables, subgroups, or analytical choices until a favorable result emerges. Because any single test has a baseline false-positive rate, running many tests dramatically inflates the probability that at least one will appear significant purely by chance — a phenomenon known as the multiple comparisons problem.
In machine learning and data science, p-hacking manifests in several familiar forms: tuning model hyperparameters on a held-out test set and reporting the best run as the final result, cherry-picking evaluation metrics after seeing outcomes, or repeatedly splitting data until a validation score looks compelling. These practices are especially tempting in competitive benchmarking environments where small performance gains carry outsized professional rewards. The flexibility of modern ML pipelines — with dozens of architectural choices, preprocessing steps, and training decisions — creates a vast "garden of forking paths" where unconscious bias can quietly steer analysts toward favorable-looking results.
The consequences for reproducibility are severe. Models or findings that appear to outperform baselines in published work frequently fail to replicate on fresh data, contributing to a broader replication crisis that spans psychology, medicine, and increasingly machine learning research. Inflated benchmark scores mislead practitioners about real-world performance and distort resource allocation toward approaches that may not generalize.
Mitigating p-hacking requires structural safeguards: pre-registering hypotheses and evaluation protocols before data collection, enforcing strict train/validation/test separation, applying corrections for multiple comparisons such as the Bonferroni method, and reporting negative results alongside positive ones. Transparency norms — sharing code, data, and all experimental runs rather than just the best — are increasingly recognized as essential to trustworthy ML research. Awareness of p-hacking has grown substantially since the early 2010s as the machine learning community began grappling with reproducibility as a first-class scientific concern.