A fairness criterion ensuring model decisions are unchanged when sensitive attributes are hypothetically altered.
Counterfactual fairness is a formal criterion for algorithmic fairness that asks whether a model's decision would remain the same if an individual's sensitive attribute — such as race, gender, or age — had been different, while all other causally downstream variables were adjusted accordingly. Unlike simpler fairness metrics that operate on statistical distributions across groups, counterfactual fairness is grounded in causal inference and requires constructing an explicit causal model of the data-generating process. A model satisfies counterfactual fairness if, in the hypothetical world where only the sensitive attribute changes, the predicted outcome for an individual does not change.
The mechanism relies on structural causal models (SCMs), which represent variables as nodes in a directed acyclic graph with explicit functional relationships. To evaluate counterfactual fairness, practitioners intervene on the sensitive attribute within this causal graph and propagate the change through all variables that are causally influenced by it. This is more rigorous than simply removing the sensitive attribute from a model's inputs, because correlated proxy variables — such as zip code or name — can still encode sensitive information and introduce bias through indirect pathways.
Counterfactual fairness matters because it addresses a fundamental limitation of group-level fairness metrics: two models can satisfy demographic parity or equalized odds while still making decisions that are causally determined by sensitive attributes at the individual level. By operating at the level of individual causal counterfactuals, this criterion provides a stronger guarantee that sensitive characteristics are not driving outcomes, even through indirect correlations. This makes it particularly relevant in high-stakes domains like credit scoring, hiring, and criminal justice, where individual-level fairness is both ethically important and legally significant.
Despite its theoretical appeal, counterfactual fairness faces practical challenges. Specifying the correct causal graph requires domain expertise and is often contested, and the approach can be sensitive to modeling assumptions. Estimating counterfactual quantities from observational data is also statistically difficult. As a result, the framework is most useful as a conceptual standard and a tool for auditing, even when full implementation is not feasible.