A while back, I wrote about passwords and promised a later post on salting. This is it: a deeper look at how servers should accept and store passwords. This is a complement to the usual articles on passwords, which focus on the user (you know the ones: “pick strong passwords”); here, I’ll be looking at the server side, and in particular how to store passwords for web sites.
A prefatory note: I’ll be referring a lot to Robert Morris and Ken Thompson’s classic paper “Password Security: A Case History”. If you’re at all involved in password handling, you should read, understand, and cherish it; virtually all of what we can do to protect passwords, even today, is based on what they wrote almost 35 years ago.
The first rule of handling passwords is that you should never store them in the clear. There is no good excuse for it whatsoever. Why not? It’s a prime rule of security: something that doesn’t exist can’t be stolen. Conversely, if something does exist, it can be stolen or leaked in many, many ways. (Read the Morris and Thompson paper for some examples.) Why do some sites store passwords in the clear nevertheless? They think it’s user-friendly to send back the original password when people click the “I forgot” link. Resetting the password to something strong and random is a much better idea. (Password reset carries its own set of risks, but that’s a topic for another day.)
The alternative to “in the clear” is not “encrypted”. If a password is stored encrypted—and I’m using that word in its technical sense—that means that there’s a key that can decrypt it, which in turn brings us back to my previous observation: if something exists, it can be stolen, keys included. Instead, what we do is “hash” the passwords using a non-invertible function. When the user supplies a password at login time, it, too, is hashed; that value is compared with the stored one. Today, the best choice is a standardized cryptographic hash function such as SHA 512. (Why SHA 512 and not MD5 or SHA-3? They’re all fine here, as we shall see below.) Cryptographic hash functions have two very important properties for protecting passwords: they can’t be inverted (in the crypto world, we call that “preimage resistance”); and they take arbitrarily long input strings, i.e., passwords. This is an advance over Morris and Thompson’s solution; at the time they wrote their paper, there were no such functions, so they were forced to use an encryption algorithm in an unnatural way, which in turn limited password length to 8 characters. Given the functions and line speeds we have available today, there are no good reasons to limit passwords to anything less than 1024 characters or thereabouts. Restricting the character set is almost as bad. Why shouldn’t I be able to put an א or a θ in my password if I want to?
The second and third things you should do are to “salt” the password before hashing, and to iterate the hash. Why? To answer that we have to understand how hashed passwords are cracked. First, assume that the attacker has somehow obtained—stolen—a copy of your hashed password file. That shouldn’t be possible, of course, but “shouldn’t” is a dirty word in the security business. Next, the attacker makes lots and lots of guesses about possible passwords, hashes them, and sees if one of the hashed guesses matches the hashed value. This is why strong passwords help: they’re harder to guess algorithmically. Of course, many people still pick really bad passwords. (And how do we know that? Some sites stored their passwords in the clear; some have been stolen and the files posted online. See above for why that’s a bad idea—and was known to be a bad idea in 1979…)
If the attacker is trying lots of guesses, one very useful defensive technique is slow down the process by making the hash function expensive. Suppose, for example, that you want to use SHA512. On my laptop, handling password-size strings, it runs at about 750,000 SHA512 operations per second. If the function is applied 75,000 times to the password, it will take 1/10 seconds to validate a login—or a guess—compared with 1/750,000 seconds without the iteration. No matter how much computing power the bad guys have (and they have more than you do, since they’re stealing it via botnets and using GPUs (graphics processing units) to do the guessing), it cuts their guess rate by a factor of 75,000. That means they can try many fewer guesses or attack many fewer people’s passwords in any given amount of time. You can’t make the iteration count too high or it will take you too long to validate legitimate login attempts, but you can certainly afford a slower response time than 1/750,000 of a second. This also explains why the speed of the underlying hash function isn’t very important: you just adjust the iteration count to match. Again using my laptop as an example, MD5 runs at roughly twice the speed of SHA512. That means I’d use 150,000 iterations instead. (Btw—you may have heard that MD5 has been cracked. It has been, but not in a way that matters here. MD5 has poor “collision resistance”; for password hashing, we only need preimage resistance. For other aspects of cryptography, though, both matter. For password storage, any hash function for which one cannot compute preimages is appropriate; you just have to set the iteration count properly.)
An implementation hint: store the iteration count with the hashed password. That way, as computers speed up and you decide to increase it, your old hashed passwords will remain usable.
Password salting is a more subtle concept and in certain circumstances is not usable. It defends against two different attacks, both of which are quite important. (Morris and Thompson also intended it to defend against custom encryption chips—read the paper—but that’s not seen as a major threat today, at least as compared with botnets and GPUs.)
What is “salt”, when it comes to passwords? The salt is a random number selected when you set the password. This number is stored, unencrypted and unhashed, with the hashed password. It becomes another input to the hash function. This is best explained by a formula. If P is the user’s password, H is some hash function (e.g., SHA 512 repeated 75,000 times), and S is the salt, the system will calculate H(S,P) and store S,H(S,P) in its password database. Since S is stored in the clear, the system can calculate H(S,P) any time someone tries to log in. The question, though, is why we’d do this.
Hashing all of these guessed passwords is expensive, especially for iterated hashes. Attackers would rather save the time that it takes to crack each new site. Accordingly, they’ve built up tables—giant files—of password/hash pairs. Cracking a password, then, is simply a matter of searching this file, and any programmer knows how that can be done very efficiently. The obvious objection is that the file is too large—and it would be, if were stored naively. Is it? Let’s crunch some numbers.
Suppose that people choose 8-character passwords made up of lower-case letters, with just a few tweaks added. There are about 2∙1011 possibilities. Done in a straightforward fashion, we’d need 8 bytes for each password and 16 for an MD5 hash; that comes to about 5 terabytes. That’s not prohibitively large, but larger than any of (today’s) commonly available disks. Suppose, though, that users optionally add a single digit to the end of their 8-letter password. 55 terabytes is a fair amount of storage, even today. Move to 9-letter passwords with that extra optional digit and it’s game over—or is it?
Using a space-time tradeoff, in particular, a technique known as rainbow tables, can drastically reduce the amount of storage necessary. You spend a bit more time cracking each password, but you make it back on space. How much is the reduction? Assume an 8-character password chosen from the entire keyboard: all letters, upper and lower case, all digits, and all special characters. Using the same straightforward storage algorithm, of 24 bytes per password, you’d need about 160 petabytes of disk space. That’s a lot… However, rainbow table for the Windows Vista hash takes only 2 terabytes, enough to fit on a single drive.
Now let’s add some salt. Because the hash of a salted password is different for every possible salt value, you need a lot more storage, even with rainbow tables. Using Morris and Thompson’s 12-bit salt—4096 (212) possible values—that Windows Vista rainbow table would need 8192 terabytes (8 petabytes). And if you used a modern salt, with 232 or even 2128 possible values—well, it would take the proverbial pot of gold from the end of the rainbow table to afford that much disk space.
There’s another advantage to salting, especially on large password files. Suppose that two users have chosen the same password. Shocking, I know, but that study of password choices I mentioned shows just how often it happens. Without salt, identical passwords have identical hashes; with salt, both the salt and the password would have to be the same. With even a 32-bit salt, that won’t happen to any noticeable extent; the attacker would have to go after each of them individually.
The first two defenses I mentioned, hashing and iterating the hash, are always applicable. If you’re not doing these, you’d better have a really good reason—and as I noted, being able to send out a forgotten password doesn’t qualify. Salting, however, isn’t always possible. If the password is used to produce a cryptographic key for the user, and if there’s no way to get the salt securely from the server to the user’s software before the encrypted connection is established, you can’t use salting. This is rarely the case on the wide-area Internet, especially for websites (websites almost always use SSL for encryption, and authenticate later), but there are certainly situations where it occurs, especially within organizations. There may be a solution in the future (you can find my attempt here), but for now we’re stuck without salt in these situations. If you’re running a website with SSL and password authentication, though, you should always salt.
Is there anything else you should do to protect passwords? Yes—lock down the server that holds your users’ passwords. I mean really lock it down; make every open service beg for its life. Don’t just lock down the Internet-facing side while leaving it open to your Intranet, completely lock it down. If that data is stolen without you knowing, you—or, more accurately, your users—are in for a world of hurt. This post is long enough already, though; I’ll save that for another day.