The modern world runs on data. Most organizations constantly collect data on individuals to ensure that their products resonate with the user as much as possible. While this is a sensible reason, most of the data usually contains Personally Identifiable Information (PII) or sensitive information that users would not be comfortable sharing, and often organizations are forbidden from doing so by data privacy regulations.
This conundrum was initially addressed using anonymization, but most anonymization techniques are not as secure as they can be, and can easily be reverse-engineered by acquiring an additional dataset. To help enhance individual privacy while allowing an organization to collect accurate data, Differential Privacy is brought in.
In this post, we’ll explore Differential Privacy in detail to explain what it is, what it is used for, its advantages, and its limitations.
Differential Privacy is a mathematical definition providing a rigorous framework for developing privacy preserving technologies that allows organizations to share information about a dataset by describing the patterns of groups within the dataset. Though the patterns are shared, all personal information about individuals is withheld. The concept aims to protect the privacy of individuals within a dataset while still successfully extracting useful information from that dataset.
Differential Privacy is widely considered to be superior to other traditional de-identification and information concealment approaches: it ensures provable quantifiable guarantees of privacy, immunity to post-processing information leakage and facilitates a modular design of an analysis from simpler analyses. Other approaches lack some or all these properties.
In recent years researchers have developed differentially private mechanisms and methods that allow organizations to perform data analysis on aggregated sensitive data, while ensuring statistical validity and maintaining the privacy of individuals in the data sets. However, Differential Privacy is not a panacea, as information leakage inevitably accumulates over time when the data is repeatedly accessed by differentially private mechanisms. Thus, Differential Privacy is not suited for all use cases requiring privacy. In particular, Differential Privacy cannot replace cryptographic protocols for communicating highly sensitive data.
Differential Privacy is implemented by applying a randomized mechanism, ℳ[D], to any information exposed from a dataset, D, to an exterior observer. The mechanism works by introducing controlled randomness or “noise” to the exposed data to protect privacy. A Differential Privacy mechanism can employ a range of techniques such as randomized response, shuffling or additive noise. The particular choice of mechanism is, essentially, tailored to the nature and quality of the information sought by the observer. The mechanism is designed to ensure information-theoretic privacy guarantee that the output of a particular analysis remains fairly the same, whether or not data about a particular individual is included.
For example, if you have a Database D, and you subtract data on individual X to get database D’, the analysis mechanism ℳ on both datasets should produce a similar result:
ℳ[D]~ℳ[D’],
such that the observer (the data analyst accessing the output of the mechanism) cannot tell with sufficient certainty whether any individual was or was not extracted from the data set.If the result is not approximately the same, then we can say that the analysis does not satisfy Differential Privacy.
The strength of the privacy guarantee in Differential Privacy is controlled by tuning the privacy parameter ε, also known as the privacy loss or privacy budget. The lower the value of the privacy budget, the more privacy afforded to each individual’s dataset. However, the amount of noise or “randomness” applied by the mechanism increases when ε is reduced. If the privacy budget becomes too small, the exposed information renders useless for any practical purpose.
The privacy budget represents the quantification of privacy loss to the data incurred by the mechanism, ℳ. This can be shown to deliver the three main advantages of Differential Privacy:
The main challenge is to construct a ‘least noisy’ mechanism which retains utility yet guarantees privacy.
The practical design of systems for Differential Privacy must take into consideration both privacy and security. Security refers to who is allowed to access a piece of data, while privacy refers to what can be inferred from a data release. The two major Differential Privacy deployment types, each addressing a different threat model, are:
The central and local models have their advantages and drawbacks. To overcome these faults, Differential Privacy can be combined with fully homomorphic encryption (FHE). FHE allows computing on encrypted data without decrypting it first, thus enabling secure computation of Differentially Private functions. In this approach, the need for a trusted curator is eliminated and the accuracy of the central model is achieved with the security benefits of the local model. The Duality Privacy Preserving Data Collaboration Platform is the only enterprise-ready solution capable of combining multiple approaches and/or privacy enhancing technologies (PETs).
Differential Privacy was first pioneered by Dwork, McSherry, Nissim and Smith and introduced in 2006 , but it wasn’t until 2016 that it started gaining traction after Apple announced that they would be using it in iOS 10 and macOS Sierra. Apple has been using Differential Privacy technology since to improve QuickType and emoji suggestions, Spotlight deep link suggestions and Lookup Hints in Notes, while minimizing the compromise of individual privacy.
In fact, in 2014 Google deployed a Differential Privacy tool called Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) to Chrome browsers. It helps Google to analyze and draw insights from browser usage while preventing sensitive information from being traced.
Since the adoption of Differentially Private tools by Google and Apple, the popularity of Differential Privacy has become widely spread, and today it is considered the de-facto standard for achieving privacy in data analysis. However, the vast majority of commercial deployments of Differential Privacy, such as Apple’s and Google’s study of trends, exclusively employ the Local Differential Privacy (LDP) approach, where the key Differentially Private mechanism widely utilized is randomized response.
The randomized response mechanism has been commonly utilized in surveys that need to protect the privacy of user responses. The mechanism guarantees privacy to individuals by providing plausible deniability. This way, individuals will become more willing to provide data to the organization if they know that it can’t be used to know something harmful about them.
For example, an organization can administer a survey where individuals are asked whether or not they have ever cheated on their taxes. The randomized response mechanism can then be used to randomize responses so that individuals won’t be afraid of being held liable.
To do that, a coin can be tossed after Bob answers, let’s say, “Yes.” If the result is heads, the response is recorded correctly. If the result is tails, the coin is tossed once again. If the result is heads, “Yes” is recorded, otherwise the answer is recorded as “No.” In this case, there’s a 75% chance that the correct answer was recorded, but it still gives Bob plausible deniability. The high percentage of correctness ensures that correct patterns can be inferred.
On the other hand, Central Differential Privacy (CDP) is far less adopted for commercial deployments. One notable example is the usage of CDP by the U.S. Census Bureau for releasing Census data since 2020. The Bureau is required to share the anonymized version of this data, such that private information can not be traced back to individuals. However, they have come to the conclusion that traditional anonymization techniques became obsolete, due to re-identification methods that make it possible to reveal information of a specific individual from a non-Differentially Private anonymized dataset.
The most common mechanism for implementing the Central Differential Privacy framework is the Laplace mechanism. This mechanism has been proven its effectiveness for masking database queries. The mechanism is, simply, an additive noise applied to the query response, such that the more sensitive the query, and the stronger the desired guarantee, the more additive noise is used to achieve the required privacy guarantee.
Organizations handle masses of data today and are responsible for ensuring that all data is protected, and the appropriate privacy measures are taken. Differential Privacy helps accomplish this by ensuring that organizations can extract patterns from a dataset without compromising privacy. In many instances, Differential Privacy needs to be combined with emerging privacy preserving technologies such as full homomorphic encryption (FHE) to ensure that organizations can still collaborate with third parties to gain business insights while protecting all sensitive information.