A controlled experiment comparing two variants to determine which performs better.
A/B testing, also known as split testing, is a controlled experimental method used to compare two versions of a system, interface, or model to determine which produces superior outcomes. Participants or data points are randomly assigned to either a control group (A) or a treatment group (B), with each group exposed to a different variant. Performance is then measured against predefined metrics — such as click-through rates, conversion rates, or prediction accuracy — and statistical tests are applied to determine whether observed differences are meaningful or simply due to chance.
In machine learning contexts, A/B testing is a cornerstone of model evaluation and deployment. When a new model or algorithm is developed, it is rarely safe to assume that offline benchmark improvements will translate directly to real-world gains. A/B testing allows teams to route a portion of live traffic to the new model while the existing model continues serving the remainder, enabling direct comparison under identical real-world conditions. This approach catches failure modes that offline evaluation often misses, such as shifts in user behavior or data distribution drift.
The methodology relies on sound statistical principles, particularly hypothesis testing and the concept of statistical significance. Practitioners must carefully determine sample sizes in advance to ensure adequate statistical power, avoiding premature conclusions from underpowered experiments. Common pitfalls include peeking at results before sufficient data is collected, running too many simultaneous tests without correction, and conflating statistical significance with practical importance. Extensions like multi-armed bandits offer more adaptive alternatives that balance exploration and exploitation during the testing period itself.
A/B testing has become a foundational practice at technology companies, where it underpins decisions ranging from ranking algorithm updates to recommendation system changes. Its value lies in grounding decisions in empirical evidence rather than intuition, reducing the risk of deploying changes that harm user experience or model performance. As ML systems grow more complex, rigorous experimentation frameworks built around A/B testing principles have become essential infrastructure for responsible model development and iteration.