Why Does Java'S Hashcode() in String Use 31 as a Multiplier

Why does Java's hashCode() in String use 31 as a multiplier?

According to Joshua Bloch's Effective Java (a book that can't be recommended enough, and which I bought thanks to continual mentions on stackoverflow):

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)

Why do we use number 31 while calculating hashcode

from Joshua Bloch, Effective Java, Chapter 3, Item 9

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

Some words about multiplication can be replaced by a shift. Multiplying to 2 is pretty easy operations in binary algebra. You just need to shift the number to the left and add 0 to the end. 4*2 = b100 << 1 = b1000 = 8. If a factor is a power of 2, you need to shift the binary number by the power value. 4*8 = 4 * 2^3 = b100 << 3 = b100000 = 32.

Also the same logic works for dividing: 8/4 = 8/2^2 = b1000 >> 2 = b10 = 2

Why chose 31 to do the multiplication in the hashcode() implementation ?

Shifting left just introduces a zero on the right and loses a bit on the left of the number's binary representation, so it's a clear information loss. Repeating this process gradually loses all information that was accumulated from earlier computation. That means that the more fields enter your hashcode calculation, the less effect on the final result the early fields have.

Why java hashcode implementation 31 * x + y is better than x + y?

If you use x+y then how to distinguish points (3,4) and (4,3)? Both will have the same hashcode...

Now while 31 * x + y will not be perfect, in the same case, it will be much much better.

Note: by definition of hashing there is no perfect hashing. The only thing is to analyse what kind of collisions occurs for a given hash function. In the geometrical case the first one introduces collisions for a very simple and usual symmetry property. Thus in very common cases there could be too much collisions.

Multiplication should be suboptimal. Why is it used in hashCode?

The answer to this is a mixture of different factors:

  • On modern architecture, the time taken to perform a multiplication versus a shift may not end up being measurable overall within a given pipeline of instructions-- it has more to do with the availability of the relevant execution unit on the CPU than the "raw" time taken;
  • In practice when integrating with standard collections libraries in day-to-day programming, it's often more important that a hash function is correct, "good enough" and easy to automate in an IDE than for it to be as perfect as possible;
  • The collections libraries generally add secondary hash functions and potentially other techniques behind the scenes to overcome some of the weaknesses of what would otherwise be a poor hash function;
  • With resizable collections, an effective hash function has the goal of dispersing its hashes across the available range for arbitrary sizes of hash tables (though as I say, it will get help from the built-in secondary function): multiplying by a "magic" constant is often a cheap way to achieve this (or, even if multiplication turned out to be a bit more expensive than a shift: still cheap enough, given the benefit); addition rather than XOR may help to allow this 'avalanche' effect slightly. (In most practical cases, you will probably find that they work equally well.)
  • You can generally assume that the JIT compiler "knows" about equivalents such as shifting 5 places and subtracting 1 rather than multiplying by 31. Just because you write "*31" in the source code doesn't mean that it will literally be compiled to a multiplication instruction. (In practice, it might be, though, because despite what you think, the multiply instruction may well be "faster" on average on the architecture in question... It's usually better to make your code stick to the required logic and let the JIT compiler handle the low level optimisations in a case such as this.)

How is the Arrays.hashCode implemented?

From the book Effective Java:

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

Why String hashCode doesn't have size limitation?

Several possible reasons spring to mind:

  1. It is common for Strings to vary only at the start or end, e.g. all StackOverflow question URLs start with "https://stackoverflow.com/questions/". Limiting the hashCode to only a subset of characters will therefore cause unnecessary collisions, and for some sets of strings cause many collisions. Your proposed algorithm would cause every stackoverflow question URL to have the same hashCode!

  2. hashCode is fast and memoized, it's not clear that limiting the hashCode to some constant length will bring noticeable performance improvements, especially since it is always preceded by creating the String (an O(n) operation), and often followed by a call to equals (also O(n)).

  3. Legacy reasons. String.hashcode is specified to use a specific algorithm. Existing applications rely on this specification. Even if this optimisation was now deemed necessary it couldn't be made without breaking backwards compatibility.

Why use a prime number in hashCode?

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.

31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).



Related Topics



Leave a reply



Submit