I have been writing recently about data and privacy. Today I want to continue by talking about aggregate data. A common intuitions is aggregate data–information averaged or summed over a large population–is inherently free of privacy implications. As we’ll see, that isn’t always right.
Suppose there is a database about all FTC employees, and you’re allowed to query it to get the total salary of all FTC employees as of any particular date. So you ask about the date January 4, 2011, and you get back a number that includes the salaries of the roughly 1100 FTC employees as of that date. The result seems to be privacy-safe, because it is aggregated over so many people.
Next, you ask another aggregate query: What was the total salary of all FTC employees on January 5, 2011? Again, you get a result aggregated over 1100-ish employees–aggregate data, which might seem safe.
But if you subtract the two aggregate values, what you get is the difference in total salary between January 4 and January 5 of 2011. Assuming that the only change in the employee roster in that one-day period is that I joined the FTC, the result will be equal to my salary, which is personal information about me.
What happened here is that subtraction caused the salaries of almost all employees to cancel out, leaving information about only one employee (me). Doing simple math on aggregate values can give you a non-aggregate result.
You might think think that you can solve this problem by watching for a sequence of queries that are too closely related and having the system refuse to answer the last one. But that turns out not to be feasible. A clever analyst can find ways to ask three, or four, or any number of queries that combine to cause trouble; and queries can be related in too many subtle and complicated ways. It turns out that there is no feasible procedure for deciding whether a sequence of aggregate queries allows inferences about an individual.
This is not meant to say that aggregate data is always dangerous, or that it is never safe to release aggregated data. Indeed, aggregated data is released safely all the time. What I am saying is more modest: the simple argument that “it’s aggregate data, therefore safe to release” is not by itself sufficient.
There are lots of examples of aggregate data turning out not to be safe. One example comes from my own research (done before I joined the FTC). Joe Calandrino, Ann Kilzer, Arvind Narayanan, Vitaly Shmatikov, and I published a paper titled “You Might Also Like: Privacy Risks of Collaborative Filtering” in which we showed that collaborative filtering systems, which recommend items based on the past activities of a population of users, can sometimes leak information about the activities of individual users. If a system tells you that people who watch the TV show “Alf” also watch “Dallas,” this fact is aggregate information–essentially a correlation that is calculated across the entire user population. But given enough of this aggregate data, over time, it can become possible (depending on the details of the system) to infer what individual users have purchased and watched. Our paper gave examples where we made individual inferences using data from real systems. In other words, aggregate data can be used to infer individual private information–sometimes.
Nowadays, many collaborative filtering systems have safeguards built in that try to address exactly this kind of inference. They make updates to the recommendations less frequent and less predictable; they show less precise information about correlations; they suppress items that have data from relatively few users; or they add random noise to the rankings. Sometimes they let users opt out from having their data used in these calculations. Done right, these kinds of precautions can protect privacy while maintaining the system’s usefulness.
Are there general techniques that can make aggregate data provably safe to release? It turns out that there are, at least in some cases. I’ll give an example in a future post.