Which Cryptographic Hash Function Should I Choose

Which cryptographic hash function should I choose?

In cryptography, hash functions provide three separate functions.

Collision resistance: How hard is it for someone to find two messages (any two messages) that hash the same.
Preimage Resistance: Given a hash, how hard is it to find another message that hashes the same? Also known as a one way hash function.
Second preimage resistance: Given a message, find another message that hashes the same.

These properties are related but independent. For example, collision resistance implies second preimage resistance, but not the other way around. For any given application, you will have different requirements, needing one or more of these properties. A hash function for securing passwords on a server will usually only require preimage resistance, while message digests require all three.

It has been shown that MD5 is not collision resistant, however, that does not preclude its use in applications that do not require collision resistance. Indeed, MD5 is often still used in applications where the smaller key size and speed are beneficial. That said, due to its flaws, researchers recommend the use of other hash functions in new scenarios.

SHA1 has a flaw that allows collisions to be found in theoretically far less than the 2^80 steps a secure hash function of its length would require. The attack is continually being revised and currently can be done in ~2^63 steps - just barely within the current realm of computability (as of April, 2009). For this reason NIST is phasing out the use of SHA1, stating that the SHA2 family should be used after 2010.

SHA2 is a new family of hash functions created following SHA1. Currently there are no known attacks against SHA2 functions. SHA256, 384 and 512 are all part of the SHA2 family, just using different key lengths.

RIPEMD I can't comment too much on, except to note that it isn't as commonly used as the SHA families, and so has not been scrutinized as closely by cryptographic researchers. For that reason alone I would recommend the use of SHA functions over it. In the implementation you are using it seems quite slow as well, which makes it less useful.

In conclusion, there is no one best function - it all depends on what you need it for. Be mindful of the flaws with each and you will be best able to choose the right hash function for your scenario.

⚠️ WARNING

August, 2022

DO NOT USE SHA-1 OR MD5 FOR CRYPTOGRAPHIC APPLICATIONS. Both of these algorithms are broken (MD5 can be cracked in 30 seconds by a cell phone).

choosing a hash function

Your question isn't clear. By "maximum number of bytes" you mean "maximum number of items"? The size of the files being hashed has no relation with the number of collisions (assuming that all files are different, of course).

And what do you mean by "maintaining the expected collision count"? Taken literally, the answer is "infinite", but after a certain number you will aways have collisions, as expected.

As for the answer to the question "How many items I can hash while maintaining the probability of a collision under x%?", take a look at the following table:

http://en.wikipedia.org/wiki/Birthday_problem#Probability_table

From the link:

For comparison, 10^-18 to 10^-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.

This assumes a hash function that outputs a uniform distribution. You may assume that, given enough items to be hashed and cryptographic hash functions (like md5 and sha) or good hashes (like Murmur3, Jenkins, City, and Spooky Hash).

And also assumes no malevolent adversary actively fabricating collisions. Then you really need a secure cryptographic hash function, like SHA-2.

And be careful: CRC and Adler are checksums, designed to detect data corruption, NOT minimizing expected collisions. They have proprieties like "detect all bit zeroing of sizes < X or > Y for inputs up to Z kbytes", but not as good statistical proprieties.

EDIT: Don't forget this is all about probabilities. It is entirely possible to hash only two files smaller than 0.5kb and get the same SHA-512, though it is extremely unlikely (no collision has ever been found for SHA hashes till this date, for example).

Good cryptographic hash functions

You shouldn't store encrypted (or even unencryped) passwords at all. Instead, use salted hashes (stretched, e.g. with PBKDF2), preferably SHA2-512.

For reference, here is a classification of the listed hashes (See wikipedia for details):

Encryption (not a hash function): DES

Non-cryptographic checksums (laughable): adler32, crc32, crc32b

Broken: MD2, MD4, MD5,SHA1

Probably broken: Tiger, snefru, GOST, HAVAL*

Probably safe: SHA2-256/384/512, RIPEMD-128/256, RIPEMD-160/320, WHIRLPOOL

Note that the strength refers to the attack of finding any password that matches a known hash (preimage attack). Also, the above sorting is paranoid, instantly discarding any hash with any known vulnerabilities.

Choosing a hash function for best performance

It depends on the number of files you have.

The chance of a collision P(collision) = c/2^N (in a perfect hash function), where c is your number of messages (files) and N is the number of bits in your collision algorithm.

As real-world hash functions aren't perfect so you have two options: optimize for speed and optimize for collision avoidance.

In the first case you will want to use CRC32. CRC32 is very common but, depending on the number of files you have, might not be enough: you're guaranteed to have a collision at ~4,3 billion messages (32 effective bits), but in practice you might encounter your first collision at ~10 million messages. CRC32 has very fast implementations (SSE 4.2 even has a hardware instruction for it). CRC64 has a lot lower chance of a collision but is not widely used, hence if you want more collision avoidance than CRC32 you better look at cryptographic hash functions.

If you want to avoid collisions while sacrificing speed you will want cryptographic hash functions, of which MD5 (128 bits), SHA-1 (160 bits) and SHA-2 (usually SHA-256 or SHA-512) are the most widely used and have fast implementations. Very efficient hash collision finding algorithms for MD5 are available, but if you input random messages you'll get as close to the P(collision) = c/2^128 as you're ever going to get while still running in reasonable time.

What is a good Hash Function?

For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.

http://www.azillionmonkeys.com/qed/hash.html

If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.

Should I store the hash function name in the DB when hashing passwords?

Yes, this is a good idea. It costs you very little (a few bytes per entry), and means that you can change and improve how you store passwords in future. For instance, suppose you'd started using this method with MD5 some years ago - it would now make it a trivial matter to upgrade to SHA1 or something else more secure by updating each user's password hash when they next log in.

Note you should be using something like PBKDF2 to hash your passwords, not just a salted hash.

What are the important points about cryptographic hash functions?

You may be confused, because the answer to the question you cite is confusing.
One of the requirements for a cryptographic hash function is that it should be preimage resistant. That is, if you know MD5(x) but not the message x, then it is difficult to find any x' (either equal x or different from x) such that MD5(x') = MD5(x).

Being preimage resistant is a different property than being reversible. A function is reversible if given y = f(x) there is exactly one x which fits (whether this is easy or not). For example define f(x) = x mod 10.
Then f is not reversible. From f(x) = 7 you can't determine whether x was 17, 27 or something else. But f is not preimage resistant, since values x' such that f(x) = 7 are easy to find. x' = 17, 27, 12341237 etc all work.

When doing crypto you usually need functions that are preimage resistant (and other properties such as collision resistance), not just something that is not reversible.

Do cryptographic hashes provide really unique results?

(Note: You're asking about hashing functions, not encryption).

It's impossible for them to be unique, by definition. They take a large input and reduce its size. It obviously follows, then, that they can't represent all the information they have compressed. So no, they don't provide "truly unique" results.

What they do provide, however, is "collision resistant" results. I.e. they try and show that two slightly different datas produce a significantly different hash.

Which Cryptographic Hash Function Should I Choose