Cherry Picking in AI: Why Benchmark Results Mislead

Cherry picking in AI and machine learning refers to the practice of selectively reporting or showcasing only the most favorable results from a set of outputs, while deliberately omitting less impressive or failed runs. Rather than presenting a representative sample of a model's behavior, cherry picking curates results to create an inflated impression of capability. This can occur in research papers, product demos, benchmark comparisons, or public showcases, and it fundamentally distorts the perceived performance of an AI system.

The mechanism is straightforward: because stochastic training processes, random seeds, and varied hyperparameter settings cause models to produce different results across runs, a practitioner can simply run a model many times and report only the best outcome. In generative AI, this might mean selecting the most coherent or visually impressive sample from hundreds of generations. In classification or regression tasks, it might mean reporting peak validation accuracy rather than average or median performance across multiple training runs. Without disclosure of how many attempts were made, the reported result becomes statistically misleading.

Cherry picking poses serious problems for scientific reproducibility and the trustworthiness of AI benchmarks. When researchers or companies selectively report results, downstream consumers — including other researchers, policymakers, and the public — form inaccurate beliefs about what AI systems can reliably do. This contributes to hype cycles and can lead to poor deployment decisions. The problem has grown more acute as generative models produce highly variable outputs, making selective curation both easier and more tempting.

Addressing cherry picking requires methodological discipline and community norms around transparency. Best practices include reporting results averaged over multiple random seeds, disclosing the number of experimental runs, using held-out test sets evaluated only once, and publishing negative results alongside positive ones. Initiatives promoting reproducibility in machine learning — such as reproducibility checklists at major conferences like NeurIPS and ICML — directly target this issue. Recognizing cherry picking as a form of evaluation bias is essential for maintaining the integrity of AI research and ensuring that reported capabilities translate to real-world reliability.

Cherry Picking

Related

Cherry Picking

Related

Related

Related