Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. A/B Testing

A/B Testing

A controlled experiment comparing two variants to determine which performs better.

Year: 2000Generality: 774
Back to Vocab

A/B testing, also known as split testing, is a controlled experimental method used to compare two versions of a system, interface, or model to determine which produces superior outcomes. Participants or data points are randomly assigned to either a control group (A) or a treatment group (B), with each group exposed to a different variant. Performance is then measured against predefined metrics — such as click-through rates, conversion rates, or prediction accuracy — and statistical tests are applied to determine whether observed differences are meaningful or simply due to chance.

In machine learning contexts, A/B testing is a cornerstone of model evaluation and deployment. When a new model or algorithm is developed, it is rarely safe to assume that offline benchmark improvements will translate directly to real-world gains. A/B testing allows teams to route a portion of live traffic to the new model while the existing model continues serving the remainder, enabling direct comparison under identical real-world conditions. This approach catches failure modes that offline evaluation often misses, such as shifts in user behavior or data distribution drift.

The methodology relies on sound statistical principles, particularly hypothesis testing and the concept of statistical significance. Practitioners must carefully determine sample sizes in advance to ensure adequate statistical power, avoiding premature conclusions from underpowered experiments. Common pitfalls include peeking at results before sufficient data is collected, running too many simultaneous tests without correction, and conflating statistical significance with practical importance. Extensions like multi-armed bandits offer more adaptive alternatives that balance exploration and exploitation during the testing period itself.

A/B testing has become a foundational practice at technology companies, where it underpins decisions ranging from ranking algorithm updates to recommendation system changes. Its value lies in grounding decisions in empirical evidence rather than intuition, reducing the risk of deploying changes that harm user experience or model performance. As ML systems grow more complex, rigorous experimentation frameworks built around A/B testing principles have become essential infrastructure for responsible model development and iteration.

Related

Related

Hypothesis Testing
Hypothesis Testing

A statistical framework for making data-driven decisions by evaluating competing claims.

Generality: 875
Ablation
Ablation

Systematically removing model components to measure their individual contribution to performance.

Generality: 700
Benchmark
Benchmark

A standardized test used to measure and compare AI model performance.

Generality: 796
Baseline
Baseline

A reference model used to benchmark whether new AI approaches actually improve performance.

Generality: 795
Test Set
Test Set

A held-out dataset used to evaluate a trained model's real-world generalization.

Generality: 820
Eval (Evaluation)
Eval (Evaluation)

Measuring an AI model's performance against defined metrics and datasets.

Generality: 838