Understanding Strange Java Hash Function

Understanding strange Java hash function

Here is some code and the sample output:

public static void main ( String[] args ) {
    int h = 0xffffffff;
    int h1 = h >>> 20;
    int h2 = h >>> 12;
    int h3 = h1 ^ h2;
    int h4 = h ^ h3;
    int h5 = h4 >>> 7;
    int h6 = h4 >>> 4;
    int h7 = h5 ^ h6;
    int h8 = h4 ^ h7;

    printBin ( h );
    printBin ( h1 );
    printBin ( h2 );
    printBin ( h3 );
    printBin ( h4 );
    printBin ( h5 );
    printBin ( h6 );
    printBin ( h7 );
    printBin ( h8 );

}

static void printBin ( int h ) {
    System.out.println ( String.format ( "%32s", 
        Integer.toBinaryString ( h ) ).replace ( ' ', '0' ) );
}

Which prints:

11111111111111111111111111111111
00000000000000000000111111111111
00000000000011111111111111111111
00000000000011111111000000000000
11111111111100000000111111111111
00000001111111111110000000011111
00001111111111110000000011111111
00001110000000001110000011100000
11110001111100001110111100011111

So, the code breaks down the hash function into steps so that you can see what is happening. The first shift of 20 positions xor with the second shift of 12 positions creates a mask that can flip 0 or more of the bottom 20 bits of the int. So you can get some randomness inserted into the bottom bits that makes use of the potentially better distributed higher bits. This is then applied via xor to the original value to add that randomness to the lower bits. The second shift of 7 positions xor the shift of 4 positions creates a mask that can flip 0 or more of the bottom 28 bits, which brings some randomness again to the lower bits and to some of the more significant ones by capitalizing on the prior xor which already addressed some of the distribution at the lower bits. The end result is a smoother distribution of bits through the hash value.

Since the hashmap in java computes the bucket index by combining the hash with the number of buckets you need to have an even distribution of the lower bits of the hash value to spread the entries evenly into each bucket.

As to proving the statement that this bounds the number of collisions, that one I don't have any input on. Also, see here for some good information on building hash functions and a few details on why the xor of two numbers tends towards random distribution of bits in the result.

Can anyone explain me the logic behind this hash-function?

Unsigned right shift (>>>), unlike signed right shift (>>), will convert negative numbers into positive numbers before preforming the right shift operation ensuring that the result returns an unsigned positive integer.
A right shift, for example h >>> 20, essentially means return the floored integer for h/Math.pow(2,20).
For example for your input 79847235, as it is a positive number, both unsigned right shift and signed right shift will return a positive integer.
79847235>>>20 will thus preform:

Math.floor(79847235/Math.pow(2,20)) which returns 76.

Next up h >>> 12 for 79847235 is:
Math.floor(79847235/Math.pow(2,12)) which returns 19493
(which is bigger because we are dividing by a smaller number).

Now we preform an exclusive OR on 76 and 19493.

For example 1^0 is 1

1^1 is 0

If we wanted the AND included, we'd have to use an inclusive OR which is |

Thus 1|1 is 1

1|0 is 0

etc

1001100 is the binary representation of 76

100110000100101 is the binary representation of 19493

An exclusive OR operation would look like this:

000000001001100 (76 in binary)

100110000100101 (19493 in binary)

---------------

100110001101001 (our result: 19561)

We're almost done for our first line:

We got this:

h ^= (h >>> 20) ^ (h >>> 12);

down to:

h ^= 19561

Which is the same as:

h = h^19561

Fill in our value for h which is 79847235:

79847235^19561 is 79827754

h = 79827754

It is important to remember that our new value for h is now 79827754

Next line:

return h ^ (h >>> 7) ^ (h >>> 4);

h>>>7 is Math.floor(79827754/Math.pow(2,7)) which returns 623654

h>>>4 is Math.floor(79827754/Math.pow(2,4)) which returns 4989234

Now that the parentheses are out of the way we have:

return h ^ 623654 ^ 4989234;

It is important to preform this left to right.

Fill in h and regroup:

return (79827754 ^ 623654) ^ 4989234;

79827754 ^ 623654 is:

100110000100001001100101010 (79827754 in binary)

000000010011000010000100110 (623654 in binary)

---------------------------

100110010111001011100001100 (80451340 in binary)

Finally we have:

return 80451340 ^ 4989234;

80451340 ^ 4989234 is:

100110010111001011100001100 (80451340 in binary)

000010011000010000100110010 (4989234 in binary)

---------------------------

100100001111011011000111110 (76002878 in binary)

Thus our final result is:

return 76002878;

Feel free to check my answer, I double checked my work already.

Due to the nature of exclusive bitwise ORs it can be very difficult to predict whether the result for the hash function will be greater or smaller than our parameter.
For example:

11^2 is 9 (smaller than our parameter 11)

17^2 is 19 (greater than our parameter 17)

What hashing function does Java use to implement Hashtable class?

When a key is added to or requested from a HashMap in OpenJDK, the flow of execution is the following:

The key is transformed into a 32-bit value using the developer-defined hashCode() method.
The 32-bit value is then transformed by a second hash function (of which Andrew's answer contains the source code) into an offset inside the hash table. This second hash function is provided by the implementation of HashMap and cannot be overridden by the developer.
The corresponding entry of the hash table contains a reference to a linked list or null, if the key does not yet exist in the hash table. If there are collisions (several keys with the same offset), the keys together with their values are simply collected in a singly linked list.

If the hash table size was chosen appropriately high, the number of collisions will be limited. Thus, a single lookup takes only constant time on average. This is called expected constant time. However, if an attacker has control over the keys inserted into a hash table and knowledge of the hash algorithm in use, he can provoke a lot of hash collisions and therefore force linear lookup time. This is why some hash table implementations have been changed recently to include a random element that makes it harder for an attacker to predict which keys will cause collisions.

Some ASCII art

key.hashCode()
     |
     | 32-bit value
     |                              hash table
     V                            +------------+    +----------------------+
HashMap.hash() --+                | reference  | -> | key1 | value1 | null |
                 |                |------------|    +----------------------+
                 | modulo size    | null       |
                 | = offset       |------------|    +---------------------+
                 +--------------> | reference  | -> | key2 | value2 | ref |
                                  |------------|    +---------------------+
                                  |    ....    |                       |
                                                      +----------------+
                                                      V
                                                    +----------------------+
                                                    | key3 | value3 | null |
                                                    +----------------------+

Why and how does HashMap have its own internal implementation of hashCode() called hash()?

The hash() derives the "improved" hash code from the actual hash code, so equal input will always be equal output (from jdk1.8.0_51):

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

As to why the hash code needs improvement, read the javadoc of the method:

Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.) So we apply a transform that spreads the impact of higher bits downward. There is a tradeoff between speed, utility, and quality of bit-spreading. Because many common sets of hashes are already reasonably distributed (so don't benefit from spreading), and because we use trees to handle large sets of collisions in bins, we just XOR some shifted bits in the cheapest possible way to reduce systematic lossage, as well as to incorporate impact of the highest bits that would otherwise never be used in index calculations because of table bounds.

Hashing function used in Java Language

Java allows you to override the hashCode() method for your Classes to use a hashing algorithm that is not only well suited to your application, but to your individual types:

public class Employee {

   private int id;
   // Default implementation might want to use "name" for as part of hashCode
   private String name; 

   @Override
   public int hashCode() {
     // We know that ID is always unique, so don't use name in calculating 
     // the hash code.
     return id;
   }
}

Understanding Strange Java Hash Function