Skew introduced when survey non-respondents differ systematically from respondents.
Non-response bias is a form of selection bias that arises when individuals who do not participate in a survey or data collection process differ in meaningful ways from those who do. In machine learning, this matters because models are only as representative as the data they are trained on. If a training dataset is built from survey responses, user feedback, or opt-in interactions, the people who choose not to respond may hold systematically different characteristics, opinions, or behaviors — and their absence quietly distorts the dataset's picture of reality.
The mechanism is straightforward but insidious. Suppose a health app collects symptom data from users who voluntarily complete weekly check-ins. Users who feel well may be less motivated to respond, while those experiencing symptoms engage more consistently. A model trained on this data would overestimate disease prevalence and underweight healthy baselines. Similarly, customer satisfaction surveys tend to attract strongly positive or strongly negative respondents, leaving moderate opinions underrepresented. In each case, the model learns from a skewed slice of the population and generalizes poorly to the full distribution it is meant to serve.
Mitigating non-response bias typically involves a combination of data collection design and post-hoc statistical correction. On the collection side, strategies include follow-up outreach, incentive structures, and reducing survey burden to improve response rates across diverse groups. After data collection, analysts may apply inverse probability weighting — upweighting responses from underrepresented groups based on known population characteristics — or use imputation methods to estimate missing values for non-respondents. Multiple imputation and propensity score adjustments are common tools borrowed from survey methodology and adapted for ML pipelines.
The stakes are particularly high when models inform decisions in sensitive domains such as healthcare, criminal justice, or credit scoring. A model trained on non-representative feedback can systematically disadvantage the very groups least likely to have participated in its training data, compounding existing inequities. Recognizing non-response bias as a data quality issue — not merely a statistical nuisance — is essential for building models that are both accurate and fair.