In recent posts, I explained why hashing and pseudonyms often fail to provide anonymity. These problems, and the well-known examples of people re-identifying supposedly anonymized data sets, might tempt you into believing that any data set can be re-identified given enough effort or that there is just no way to provide access to data in a privacy-preserving way. But those conclusions would be incorrect.
One of the most exciting developments in privacy research in recent years has been the emergence of a workable, formal definition of privacy-preserving data access, along with algorithms that can provide a mathematical proof of privacy preservation in certain cases. This work, credited to the computer scientist Cynthia Dwork and others, goes under the name “differential privacy”. It’s a name only a theorist could love, but the ideas are worth learning.
Let’s unpack what differential privacy means. This will be just a bit technical, but you don’t need advanced math to get the gist. So hang on and I’ll try to guide you through it.
Assume we have some kind of data set that is privacy-relevant–it includes or is based on potentially sensitive information about individuals. We want to let an analyst get answers to questions about the data in aggregate, but without letting analyst deduce private information about any individual. The trick is how to turn this intuitive goal into a precise definition.
Defining privacy-preserving data access turns out to be harder than you might expect, because many superficially attractive definitions break down. Some definitions work if the analyst can ask only one question, but get into trouble if the analyst can ask multiple questions–the analyst might be able to ask two “harmless” questions whose answers, taken together, reveal everything about an individual. Some definitions break down if the analyst has any access to “outside” information about the world, even basic information like the fact that the average human is less than ten feet tall. Some definitions turn out to be impossible to achieve in this universe. Differential privacy avoids all of these pitfalls.
The core of the definition is simple but subtle. Imagine two data sets, A and B, which are exactly identical except that A includes information about one additional person who is not included in B. Now we ask: If we let the analyst get answers to questions about our data set, can the analyst tell whether we are using data set A or data set B? If the analyst can’t tell the difference, and this result holds no matter which data set A we started with and no matter which person we excluded from B, then we have succeeded in preserving privacy. Why? Because any inference the analyst makes about you will have to be the same inference he would have made even if your information were not in the data set at all. To put it another way, including your information in the data set did nothing to harm your privacy.
Now there is one more itty-bitty problem, which is that the definition as stated so far can’t quite be achieved, so we have to modify the definition slightly. Instead of saying that the analyst’s results have to be exactly identical whether we’re using data set A or B, we will say that the results have to be almost identical. We will allow a very small probability that the analyst’s answers depend on whether we are using A or B. For example, if the probability that the analyst can tell the difference is less than 0.01%, then we’ll say that we have achieved differential privacy at the 0.01% level.
To give you an idea of what this means in practice, suppose we give an analyst controlled access to a medical database. And suppose the analyst wants to violate my privacy by figuring out whether I have diabetes. If the database doesn’t include any information about me, the analyst can still try to estimate the likelihood that I have diabetes, perhaps by using information about the prevalence of diabetes in the population at large. Let’s say his best guess, based on all of the available medical science and statistics about the population generally, is that there is a 2% chance that I have diabetes. Now if we give the analyst controlled access to my doctor’s database, via a method that guarantees differential privacy at the 0.01% level, then the analyst might be able to adjust his estimate of the odds that I have diabetes–but only by a tiny bit. His best estimate of the odds that I am diabetic, which was originally 2%, might go as low as 1.9998% or as high as 2.0002%. The tiny difference of 0.0002% is too small to worry about.
That’s differential privacy.
Why is this definition important? There are two reasons. First, the definition avoids the pitfalls I mentioned above–it lets us reason sensibly about situations with multiple queries; it is robust against the possibility that the analyst has access to “outside” information; and it is achievable in this universe. Second, in at least some important cases, there are known methods that can provably guarantee differential privacy. I’ll discuss some of those in future posts.
[Footnote for theory purists: I took some minor mathematical liberties above to simplify the explanation. The formal definition makes the queries randomized functions of the data set, and defines privacy in terms of the ratios of the probabilities (over the random bits used by the query functions) of sets of query results. Also the privacy parameter is exponentiated for technical reasons.]