When I wrote previously about differential privacy, a mathematical framework that allows rigorous reasoning about privacy preservation, I promised to work through an example to show how the theory works. Here goes.
Suppose that Alice has access to a detailed database about everyone in the United States. Bob wants to do some statistical analysis, to get aggregate statistics about the population. But Alice wants to make sure Bob can’t infer anything about an individual. Rather than giving Bob the raw data—which would surely undermine privacy—Alice will let Bob send her queries about the data, and Alice will answer Bob’s queries. The trick is to make sure that Alice’s answers don’t leak private information.
Recall from a previous post that even if Bob only asks for aggregate data, the result still might not be safe. But differential privacy gives Alice a way to answer the queries in a way that is provably safe.
The key idea is for Alice to add random “noise” to the results of queries. If Bob asks how many Americans are left-handed, Alice will compute the true result, then add a little bit of random noise to the true result, to get the altered result that she will return to Bob. Differential privacy tells Alice exactly how much noise she needs to add, and in exactly which form, to get the necessary privacy guarantees.
The key idea is that if Bob attempts to extract facts about you, as an individual, from the answers he gets, then the noise that Alice added will wash out the effect of your individual data. Like movie spies who turn on music to drown out the sound of a whispered conversation—and thereby frustrate listening devices—the noise added to Alice’s answers will drown out the effect of your individual data.
The key to this method succeeding is to have the noise be just loud enough to drown out an individual’s records, but not so loud that Bob loses the ability to detect trends in the population. There is a rich technical literature about how to do this and when it tends to work well.
One interesting aspect of this approach is that it treats errors in data as a good thing. This might seem at first to be in tension with traditional privacy principles, which generally treat error as something to be avoided. For example, the Fair Information Practices include a principle of data accuracy, and a right to correct inaccurate data. But on further investigation, these two approaches to error turn out not to be in contradiction. One of the main points of the differential privacy approach is that Bob, who receives the erroneous information, knows that it is erroneous, and knows roughly how much error there is. So Bob knows that there is no point in trying to rely on this information to make decisions about individuals. By contrast, errors are problematic in traditional privacy settings because the data recipient doesn’t know much about the distribution of errors and is likely to assume that data are more accurate than they really are.
So how much noise will Alice need to add to Bob’s query about the number of left-handed Americans? If we assume that a 1% level of differential privacy is required—meaning that Bob can get no more than a 1% advantage over random guessing if we challenge him to guess whether or not your data is included in the data set— then the typical size of the error will have to be about 100. The error might be bigger or smaller in a particular case, but the magnitude of the added error will on average be about 100 people. Compared to the number of left-handed people in America, that is a small error.
If Bob wants to ask a lot of questions, then the error in each response will have to be bigger—but the good news is that Bob can ask any question of the form “How many Americans are …” and this same mechanism can provide a provable level of privacy protection.
Differential privacy can handle an ever-growing set of situations. One thing it can’t provide, though—because no privacy mechanism can—is a free lunch. In order to get privacy, you have to trade away some utility. The good news is that if you do things right, you might be able to get a lot of privacy without requiring Bob to give up much utility.