A dataset imbalance where underrepresented groups cause skewed model performance.
Coverage bias refers to the systematic underrepresentation of certain groups, topics, or conditions within a training dataset, causing machine learning models to perform unevenly across different segments of the real world. When a dataset captures some portions of a population or domain far more thoroughly than others, the model trained on it learns a distorted picture of reality — excelling where data is abundant and failing where it is sparse. This is distinct from labeling errors or measurement noise; the problem lies in which examples were collected in the first place.
The mechanism is straightforward: gradient-based learning algorithms optimize for average loss across the training distribution. If a subgroup constitutes only a small fraction of that distribution, errors on its examples contribute little to the total loss, so the model has weak incentive to represent it accurately. The result is a system that may achieve high aggregate accuracy while being unreliable or actively harmful for minority groups — a pattern documented in facial recognition, clinical risk scoring, natural language processing, and many other applied domains.
Coverage bias often originates upstream of modeling, in decisions about what data to collect, from whom, and under what conditions. Medical datasets historically over-sampled certain demographics; image datasets scraped from the web reflect the demographics of internet users; speech corpora may exclude regional accents or non-native speakers. Because these gaps are invisible inside the dataset itself, they can persist undetected through standard validation pipelines that report only aggregate metrics.
Addressing coverage bias requires deliberate audit and remediation at the data collection stage — stratified sampling, targeted data acquisition, and disaggregated evaluation across subgroups. Techniques such as reweighting, oversampling underrepresented classes, and fairness-aware training objectives can partially compensate, but they are no substitute for representative data. As AI systems are deployed in high-stakes settings like hiring, lending, and healthcare, coverage bias has become a central concern in responsible AI development, directly linking data curation practices to questions of equity and accountability.