A data flaw where training samples misrepresent the true population, distorting model behavior.
Sampling bias occurs when the data collected to train or evaluate a machine learning model fails to accurately represent the broader population or real-world conditions the model is intended to serve. This mismatch can arise from many sources: convenience sampling that favors easily accessible data, systematic exclusion of underrepresented groups, historical data that encodes past inequities, or feedback loops where a model's own predictions influence future data collection. The result is a training distribution that diverges from the deployment distribution, causing models to perform well on paper but poorly—or unfairly—in practice.
The mechanisms through which sampling bias propagates into model behavior are varied. A facial recognition system trained predominantly on lighter-skinned faces will exhibit higher error rates on darker-skinned individuals. A medical diagnostic model built on data from large urban hospitals may generalize poorly to rural populations with different disease prevalence. In each case, the model learns the statistical patterns present in the sample rather than the true underlying patterns of the world. This is particularly dangerous because standard evaluation metrics computed on a biased test set may mask the problem entirely, giving false confidence in model quality.
Addressing sampling bias requires deliberate intervention at multiple stages of the machine learning pipeline. During data collection, stratified sampling ensures that key subgroups are represented in proportion to their real-world prevalence—or deliberately oversampled when minority-group performance is a priority. Data augmentation and synthetic data generation can help balance skewed datasets. Fairness-aware evaluation, which disaggregates performance metrics across demographic and contextual subgroups, is essential for detecting bias that aggregate metrics obscure. Techniques such as importance weighting and domain adaptation can also help correct for known distributional mismatches between training and deployment.
Sampling bias has become a central concern in applied machine learning as AI systems are increasingly deployed in high-stakes domains including criminal justice, hiring, lending, and healthcare. Biased training data in these contexts can systematically disadvantage already-marginalized groups, amplifying societal inequities at scale. Recognizing and mitigating sampling bias is therefore not only a technical imperative for model accuracy but a foundational requirement for building AI systems that are fair, trustworthy, and broadly beneficial.