Let’s continue our discussion of “anonymous” data by talking about pseudonyms.
A pseudonym is any kind of identifier, other than a name, that is associated with a person or (what often amounts to the same thing) a device. Pseudonyms are very common. Examples include the random ID value in a tracking cookie; a device ID such as a WiFi MAC address or a phone’s UDID; a synthetic identifier such as an “OpenUDID”; a mobile phone number; or a Twitter handle.
Sometimes a pseudonym contains directly some information about the subject. For example, the Twitter handle “@TechFTC” is a pseudonym that conveys information about its subject, i.e. that it probably has something to do with technology and the FTC. That in itself could compromise anonymity (although in this particular case it’s no secret who is behind @TechFTC.) But I want to talk about pseudonyms in general, so let’s assume from here on that we’re talking about a “pure pseudonym” that is chosen in an entirely random and unpredictable fashion.
You might think that a randomly chosen pure pseudonym conveys no information about anybody, but that is not right. As soon as you associate the pseudonym with somebody, the pseudonym gives you the ability to record information about that person, or associate information with them. For example, if you can observe the browsing habits of a person who has a known pseudonym, then you can build up a record of where that individual browsed–which conveys information about them.
AOL was reminded about this, to their regret, after they released records of some of their users’ searches, with the user names replaced by numeric pseudonyms. The New York Times, famously, was able to use the search history associated with the pseudonym 4417749 to identify that pseudonym as belonging to Mrs. Thelma Arnold, of Lilburn, GA.
But behavioral history is not the only way a pseudonym can convey information about identity. As long as a pseudonym is still attached to the individual, there is the possibility that the pseudonym can be associated with the user’s real identity. Suppose, for example, that a site has implanted a tracking cookie, containing a unique number, into my browser. The site might not know exactly who I am. But if I then go to the site and log in–revealing my identity–then my identity becomes linkable to the pseudonym, and all of the information associated with the pseudonym becomes linkable to my identity.
So it’s clear that pseudonyms are not “anonymous” and that attaching a pseudonym to a user, or gathering information about a pseudonymous user over time, can impact privacy. At least two factors increase the privacy impact of a pseudonym. First, a pseudonym has greater privacy impact if it is shared across data collectors, rather than being used by a single collector. Sharing a pseudonym (or “syncing” separate pseudonyms) allows collectors to connect more collected information, thereby increasing the privacy impact. Second, a pseudonym has greater privacy impact if it has a longer lifetime or cannot easily be changed or erased by the user.
A classic example of a pseudonym with high privacy impact is the social security number. The SSN might look like a pure pseudonym–a meaningless nine-digit number–but despite this it is rightly recognized as privacy-sensitive. The SSN is shared across many data collectors, persists for decades, and is very difficult to change.
It’s clear that we can’t write off pseudonyms as “anonymous” and free of privacy implications. But how should we think about pseudonyms and privacy? That question opens the door to a much broader discussion about information, identity, and privacy. I won’t jump into that conversation here–I have written enough already for today–but I hope we’ll shed more light in the comments and in future posts.

