A mathematical framework that protects individual privacy while enabling useful statistical analysis of datasets.
Differential privacy is a rigorous mathematical framework for quantifying and bounding the privacy risk to individuals when their data is used in statistical analyses or machine learning models. A mechanism satisfies differential privacy if the probability of any given output changes by at most a small multiplicative factor — controlled by a parameter epsilon (ε) — regardless of whether any single individual's record is included or excluded from the dataset. This guarantee means that an adversary observing the output of a differentially private computation cannot reliably infer whether any particular person's data contributed to it, providing a formal and provable privacy protection rather than an ad hoc one.
In practice, differential privacy is typically achieved by injecting carefully calibrated random noise into computations. The Laplace mechanism adds noise drawn from a Laplace distribution scaled to the query's sensitivity, while the Gaussian mechanism uses Gaussian noise and is often preferred in deep learning contexts. For training neural networks, the DP-SGD algorithm clips per-sample gradients to bound their sensitivity and then adds Gaussian noise before each parameter update, allowing models to be trained with formal privacy guarantees. The privacy budget ε tracks cumulative information leakage across multiple queries or training steps, and managing this budget is central to practical deployments.
Differential privacy matters because it resolves a fundamental tension between data utility and individual privacy. Traditional anonymization techniques — such as removing names or aggregating records — have repeatedly proven insufficient against re-identification attacks, as demonstrated by high-profile de-anonymization of supposedly anonymous datasets. Differential privacy's mathematical guarantees hold even against adversaries with arbitrary auxiliary information, making it substantially more robust. Major technology companies including Apple, Google, and Microsoft have deployed differentially private systems for collecting telemetry and training models at scale, and the U.S. Census Bureau applied it to the 2020 decennial census.
The framework is especially consequential for machine learning because trained models can inadvertently memorize and leak sensitive training data, a vulnerability exposed by membership inference and model inversion attacks. Differentially private training directly limits this leakage, enabling organizations to build and share models on sensitive data — in healthcare, finance, and other regulated domains — while providing auditable, quantifiable privacy assurances that satisfy regulatory requirements such as GDPR.