Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Cherry Picking

Cherry Picking

Selectively presenting only the most favorable outputs to misrepresent an AI system's true performance.

Year: 2012Generality: 398
Back to Vocab

Cherry picking in AI and machine learning refers to the practice of selectively reporting or showcasing only the most favorable results from a set of outputs, while deliberately omitting less impressive or failed runs. Rather than presenting a representative sample of a model's behavior, cherry picking curates results to create an inflated impression of capability. This can occur in research papers, product demos, benchmark comparisons, or public showcases, and it fundamentally distorts the perceived performance of an AI system.

The mechanism is straightforward: because stochastic training processes, random seeds, and varied hyperparameter settings cause models to produce different results across runs, a practitioner can simply run a model many times and report only the best outcome. In generative AI, this might mean selecting the most coherent or visually impressive sample from hundreds of generations. In classification or regression tasks, it might mean reporting peak validation accuracy rather than average or median performance across multiple training runs. Without disclosure of how many attempts were made, the reported result becomes statistically misleading.

Cherry picking poses serious problems for scientific reproducibility and the trustworthiness of AI benchmarks. When researchers or companies selectively report results, downstream consumers — including other researchers, policymakers, and the public — form inaccurate beliefs about what AI systems can reliably do. This contributes to hype cycles and can lead to poor deployment decisions. The problem has grown more acute as generative models produce highly variable outputs, making selective curation both easier and more tempting.

Addressing cherry picking requires methodological discipline and community norms around transparency. Best practices include reporting results averaged over multiple random seeds, disclosing the number of experimental runs, using held-out test sets evaluated only once, and publishing negative results alongside positive ones. Initiatives promoting reproducibility in machine learning — such as reproducibility checklists at major conferences like NeurIPS and ICML — directly target this issue. Recognizing cherry picking as a form of evaluation bias is essential for maintaining the integrity of AI research and ensuring that reported capabilities translate to real-world reliability.

Related

Related

Reporting Bias
Reporting Bias

A systematic distortion in training data caused by selective omission of outcomes or observations.

Generality: 694
P-hacking
P-hacking

Manipulating statistical analysis to produce significant results through selective testing.

Generality: 531
Sampling Bias
Sampling Bias

A data flaw where training samples misrepresent the true population, distorting model behavior.

Generality: 794
Adversarial Evaluation
Adversarial Evaluation

Testing AI systems by deliberately crafting inputs designed to expose failures.

Generality: 694
Kaggle Effect
Kaggle Effect

Competition-optimized ML models that fail to generalize beyond their specific contest datasets.

Generality: 101
AI Effect
AI Effect

Achieved AI tasks are dismissed as 'not real intelligence,' perpetually moving the goalposts.

Generality: 520