Why Use a Prime Number in Hashcode

Why use a prime number in hashCode?

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.

31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).

Why should hash functions use a prime number modulus?

Usually a simple hash function works by taking the "component parts" of the input (characters in the case of a string), and multiplying them by the powers of some constant, and adding them together in some integer type. So for example a typical (although not especially good) hash of a string might be:

(first char) + k * (second char) + k^2 * (third char) + ...

Then if a bunch of strings all having the same first char are fed in, then the results will all be the same modulo k, at least until the integer type overflows.

[As an example, Java's string hashCode is eerily similar to this - it does the characters reverse order, with k=31. So you get striking relationships modulo 31 between strings that end the same way, and striking relationships modulo 2^32 between strings that are the same except near the end. This doesn't seriously mess up hashtable behaviour.]

A hashtable works by taking the modulus of the hash over the number of buckets.

It's important in a hashtable not to produce collisions for likely cases, since collisions reduce the efficiency of the hashtable.

Now, suppose someone puts a whole bunch of values into a hashtable that have some relationship between the items, like all having the same first character. This is a fairly predictable usage pattern, I'd say, so we don't want it to produce too many collisions.

It turns out that "because of the nature of maths", if the constant used in the hash, and the number of buckets, are coprime, then collisions are minimised in some common cases. If they are not coprime, then there are some fairly simple relationships between inputs for which collisions are not minimised. All the hashes come out equal modulo the common factor, which means they'll all fall into the 1/n th of the buckets which have that value modulo the common factor. You get n times as many collisions, where n is the common factor. Since n is at least 2, I'd say it's unacceptable for a fairly simple use case to generate at least twice as many collisions as normal. If some user is going to break our distribution into buckets, we want it to be a freak accident, not some simple predictable usage.

Now, hashtable implementations obviously have no control over the items put into them. They can't prevent them being related. So the thing to do is to ensure that the constant and the bucket counts are coprime. That way you aren't relying on the "last" component alone to determine the modulus of the bucket with respect to some small common factor. As far as I know they don't have to be prime to achieve this, just coprime.

But if the hash function and the hashtable are written independently, then the hashtable doesn't know how the hash function works. It might be using a constant with small factors. If you're lucky it might work completely differently and be nonlinear. If the hash is good enough, then any bucket count is just fine. But a paranoid hashtable can't assume a good hash function, so should use a prime number of buckets. Similarly a paranoid hash function should use a largeish prime constant, to reduce the chance that someone uses a number of buckets which happens to have a common factor with the constant.

In practice, I think it's fairly normal to use a power of 2 as the number of buckets. This is convenient and saves having to search around or pre-select a prime number of the right magnitude. So you rely on the hash function not to use even multipliers, which is generally a safe assumption. But you can still get occasional bad hashing behaviours based on hash functions like the one above, and prime bucket count could help further.

Putting about the principle that "everything has to be prime" is as far as I know a sufficient but not a necessary condition for good distribution over hashtables. It allows everybody to interoperate without needing to assume that the others have followed the same rule.

[Edit: there's another, more specialized reason to use a prime number of buckets, which is if you handle collisions with linear probing. Then you calculate a stride from the hashcode, and if that stride comes out to be a factor of the bucket count then you can only do (bucket_count / stride) probes before you're back where you started. The case you most want to avoid is stride = 0, of course, which must be special-cased, but to avoid also special-casing bucket_count / stride equal to a small integer, you can just make the bucket_count prime and not care what the stride is provided it isn't 0.]

What is a sensible prime for hashcode calculation?

I recommend using 92821. Here's why.

To give a meaningful answer to this you have to know something about the possible values of i and j. The only thing I can think of in general is, that in many cases small values will be more common than large values. (The odds of 15 appearing as a value in your program are much better than, say, 438281923.) So it seems a good idea to make the smallest hashcode collision as large as possible by choosing an appropriate prime. For 31 this rather bad - already for i=-1 and j=31 you have the same hash value as for i=0 and j=0.

Since this is interesting, I've written a little program that searched the whole int range for the best prime in this sense. That is, for each prime I searched for the minimum value of Math.abs(i) + Math.abs(j) over all values of i,j that have the same hashcode as 0,0, and then took the prime where this minimum value is as large as possible.

Drumroll: the best prime in this sense is 486187739 (with the smallest collision being i=-25486, j=67194). Nearly as good and much easier to remember is 92821 with the smallest collision being i=-46272 and j=46016.

If you give "small" another meaning and want to be the minimum of Math.sqrt(i*i+j*j) for the collision as large as possible, the results are a little different: the best would be 1322837333 with i=-6815 and j=70091, but my favourite 92821 (smallest collision -46272,46016) is again almost as good as the best value.

I do acknowledge that it is quite debatable whether these calculation make much sense in practice. But I do think that taking 92821 as prime makes much more sense than 31, unless you have good reasons not to.

How to pick prime numbers to calculate the hash code?

The comments on the answer you link to already briefly try to explain why 17 and 23 are not good primes to use here.

A lot of .NET classes that make use of hash codes store elements in buckets. Suppose there are three buckets. Then all objects with hash code 0, 3, 6, 9, ... get stored in bucket 0. All objects with hash code 1, 4, 7, 10, ... get stored in bucket 1. All objects with bucket 2, 5, 8, 11, ... get stored in bucket 2.

Now suppose that your GetHashCode() uses hash = hash * 3 + field3.GetHashCode();. This would mean that unless hash is large enough for the multiplication to wrap around, in a hash set with three buckets, which bucket an object would end up in depends only on field3.

With an uneven distribution of objects across buckets, HashSet<T> cannot give good performance.

You want a factor that is co-prime to all possible number of buckets. The number of buckets itself will be prime, for the same reasons, therefore if your factor is prime, the only risk is that it's equal to the number of buckets.

.NET uses a fixed list of allowed numbers of buckets:

public static readonly int[] primes = {
3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107, 131, 163, 197, 239, 293, 353, 431, 521, 631, 761, 919,
1103, 1327, 1597, 1931, 2333, 2801, 3371, 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591,
17519, 21023, 25229, 30293, 36353, 43627, 52361, 62851, 75431, 90523, 108631, 130363, 156437,
187751, 225307, 270371, 324449, 389357, 467237, 560689, 672827, 807403, 968897, 1162687, 1395263,
1674319, 2009191, 2411033, 2893249, 3471899, 4166287, 4999559, 5999471, 7199369};

Your factor should be one that .NET doesn't use, and that other custom implementations are equally unlikely to use. This means 23 is a bad factor. 31 could be okay with .NET's own containers, but could be equally bad with custom implementations.

At the same time, it should not be so low that it gives lots of collisions for common uses. This is a risk with 3 and 5: suppose you have a custom Tuple<int, int> implementation with lots of small integers. Keep in mind that int.GetHashCode() just returns that int itself. Suppose your multiplication factor is 3. That means that (0, 9), (1, 6), (2, 3) and (3, 0) all give the same hash codes.

Both of the problems can be avoided by using sufficiently large primes, as pointed out in a comment that Jon Skeet had incorporated into his answer:

EDIT: As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good...

Once upon a time, large primes for multiplication may have been bad because multiplication by large integers was sufficiently slow that the performance difference was noticeable. Multiplication by 31 would be good in that case because it can be implemented as x * 31 => x * 32 - x => (x << 5) - x. Nowadays, though, the multiplication is far less likely to cause any performance problems, and then, generally speaking, the bigger the better.

Why multiply by a prime before xoring in many GetHashCode Implementations?

There's a good article on the Computing Life blog that discusses this topic in detail. It was originally posted as a response to the Java hashCode() question I linked to in the question. According to the article:

Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.

Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used.

However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about hashes without primes.

Why do we use number 31 while calculating hashcode

from Joshua Bloch, Effective Java, Chapter 3, Item 9

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

Some words about multiplication can be replaced by a shift. Multiplying to 2 is pretty easy operations in binary algebra. You just need to shift the number to the left and add 0 to the end. 4*2 = b100 << 1 = b1000 = 8. If a factor is a power of 2, you need to shift the binary number by the power value. 4*8 = 4 * 2^3 = b100 << 3 = b100000 = 32.

Also the same logic works for dividing: 8/4 = 8/2^2 = b1000 >> 2 = b10 = 2

Why use a prime number in hashCode?

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.

31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).

Why does Java's hashCode() in String use 31 as a multiplier?

According to Joshua Bloch's Effective Java (a book that can't be recommended enough, and which I bought thanks to continual mentions on stackoverflow):

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)



Related Topics



Leave a reply



Submit