Best Hashing Algorithm in Terms of Hash Collisions and Performance for Strings

Best hashing algorithm in terms of hash collisions and performance for strings

Forget about the term "best". No matter which hash algorithm anyone might come up with, unless you have a very limited set of data that needs to be hashed, every algorithm that performs very well on average can become completely useless if only being fed with the right (or from your perspective "wrong") data.

Instead of wasting too much time thinking about how to get the hash more collision-free without using too much CPU time, I'd rather start thinking about "How to make collisions less problematic". E.g. if every hash bucket is in fact a table and all strings in this table (that had a collision) are sorted alphabetically, you can search within a bucket table using binary search (which is only O(log n)) and that means, even when every second hash bucket has 4 collisions, your code will still have decent performance (it will be a bit slower compared to a collision free table, but not that much). One big advantage here is that if your table is big enough and your hash is not too simple, two strings resulting in the same hash value will usually look completely different (hence the binary search can stop comparing strings after maybe one or two characters on average; making every compare very fast).

Actually I had a situation myself before where searching directly within a sorted table using binary search turned out to be faster than hashing! Even though my hash algorithm was simple, it took quite some time to hash the values. Performance testing showed that only if I get more than about 700-800 entries, hashing is indeed faster than binary search. However, as the table could never grow larger than 256 entries anyway and as the average table was below 10 entries, benchmarking clearly showed that on every system, every CPU, the binary search was faster. Here, the fact that usually already comparing the first byte of the data was enough to lead to the next bsearch iteration (as the data used to be very different in the first one to two byte already) turned out as a big advantage.

So to summarize: I'd take a decent hash algorithm, that doesn't cause too many collisions on average and is rather fast (I'd even accept some more collisions, if it's just very fast!) and rather optimize my code how to get the smallest performance penalty once collisions do occur (and they will! They will unless your hash space is at least equal or bigger than your data space and you can map a unique hash value to every possible set of data).

Fast String Hashing Algorithm with low collision rates with 32 bit integer

One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs.

Choosing a hash function for best performance

It depends on the number of files you have.

The chance of a collision P(collision) = c/2^N (in a perfect hash function), where c is your number of messages (files) and N is the number of bits in your collision algorithm.

As real-world hash functions aren't perfect so you have two options: optimize for speed and optimize for collision avoidance.

In the first case you will want to use CRC32. CRC32 is very common but, depending on the number of files you have, might not be enough: you're guaranteed to have a collision at ~4,3 billion messages (32 effective bits), but in practice you might encounter your first collision at ~10 million messages. CRC32 has very fast implementations (SSE 4.2 even has a hardware instruction for it). CRC64 has a lot lower chance of a collision but is not widely used, hence if you want more collision avoidance than CRC32 you better look at cryptographic hash functions.

If you want to avoid collisions while sacrificing speed you will want cryptographic hash functions, of which MD5 (128 bits), SHA-1 (160 bits) and SHA-2 (usually SHA-256 or SHA-512) are the most widely used and have fast implementations. Very efficient hash collision finding algorithms for MD5 are available, but if you input random messages you'll get as close to the P(collision) = c/2^128 as you're ever going to get while still running in reasonable time.

What is the best 32bit hash function for short strings (tag names)?

If performance isn't important, simply take a secure hash such as MD5 or SHA1, and truncate its output to 32 bits. This will give you a distribution of hash codes that's indistinguishable from random.

What is a good Hash Function?

For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.

http://www.azillionmonkeys.com/qed/hash.html

If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.

What hash function produces the maximum number of collisions when hashing n keys?

The maximum number of collisions is equal to the number of items you hash.


Example:

hash function: h(x) = 3

All items will be hashed to key 3.


Notice that number of keys, n in your case, doesn't affect the answer, since no matter how many keys you have, your items are always going to be hashed in key 3, with the h(x) I provided above.


Visualization:

Usually, hashing looks like this:

Sample Image

but if I want to have the maximum number of collisions, then, by using the h(x) provided above I will get all my items (the names in the picture) all hashed to the vary same key, i.e. key 3.

So in that case the maximum number of collisions is the number of names, 5.



Related Topics



Leave a reply



Submit