One of the most misunderstood topics in privacy is what it means to provide “anonymous” access to data. One often hears references to “hashing” as a way of rendering data anonymous. As it turns out, hashing is vastly overrated as an “anonymization” technique. In this post, I’ll talk about what hashing is, and why it often fails to provide effective anonymity.
What is hashing anyway? What we’re talking about is technically called a “cryptographic hash function” (or, to super hardcore theory nerds, a randomly chosen member of a pseudorandom function family–but I digress). I’ll just call it a “hash” for short. A hash is a mathematical function: you give it an input value and the function thinks for a while and then emits an output value; and the same input always yields the same output. What makes a hash special is that it is as unpredictable as a mathematical function can be–it is designed so that there is no rhyme or reason to its behavior, except for the iron rule that the same input always yields the same output. (In this post I’ll use a hash called SHA-1.)
With that out of the way, let’s consider whether hashing a Social Security Number renders it “anonymous”. If you hash my SSN, the result is b0254c86634ff9d0800561732049ce09a2d003e1. (Let’s call this the “b02 value” for short.) That looks nothing like my SSN–but that in itself does not make the value “anonymous”. The key question is whether a person who gets the b02 value can figure out what my SSN is.
How might an analyst who has the b02 value try to determine my SSN? One approach that doesn’t work is to try to run the hash function backward–or as a mathematician would say, to find its inverse. Many functions can be run backward. Consider the function that adds 17 to its input. To run that function backward, you just subtract 17. The hash has an inverse (of a sort) but nobody knows what it is, and as far as anyone knows it is not feasible to find the inverse. So a smart analyst will give up on the invert-the-hash approach.
But there is another trick available to the analyst–and this trick will work. The analyst simply guesses my SSN–he enumerates all of the possible nine-digit SSNs and hashes each one. When he hashes my correct SSN, the result will be equal to the b02 number, so he will know that he guessed right. You might think it would take a long time to run through all of the possible SSNs, but computers are very fast–there are “only” one billion possible SSNs, so your laptop can hash all of them in less time than it takes you to get a cup of coffee.
A clever analyst would do it even faster. He would hash all of the possible SSNs in advance, and build an index that allowed him to recover the SSN from its corresponding hash value in the blink of an eye. Hashing the SSN would offer no protection at all against an analyst who had built such an index.
It should be clear by this point that hashing an SSN does not render it anonymous. The same is true for any data field, unless it is much, much, much harder to guess than an SSN–and bear in mind that in practice the analyst who is doing the guessing might have access to other information about the person in question, to help guide his guessing.
Does this means that hashing always fails, and is never a good way to scrub data? Almost, but not quite. There are more advanced uses of hashing that can offer some protection in some settings. But the casual assumption that hashing is sufficient to anonymize data is risky at best, and usually wrong.
[In case you’re wondering, the b02 value is not really the hash of my SSN. It is the hash of the text string “my SSN”. There is no way I would publish the hash of my actual SSN.]