Hashmap Implementation in Java. How Does the Bucket Index Calculation Work

HashMap implementation in Java. How does the bucket index calculation work?

It's not calculating the hash, it's calculating the bucket.

The expression h & (length-1) does a bit-wise AND on h using length-1, which is like a bit-mask, to return only the low-order bits of h, thereby making for a super-fast variant of h % length.

How does HashMap make sure the index calculated using hashcode of key is within the available range?

For your first question: the map always uses a power of two for the size (if you give it a capacity of 10, it will actually use 16), which means index & (length - 1) will always be in the range [0, length) so it's always in range.

It's not clear what your second and third question relate to. I don't think HashMap reallocates everything unless it needs to.

Why index of hashcode is calculated in HashMap

An object's hash code can be any int value, between -2^31 and 2^31-1. The underlying array used by a hash table is not going to have the same range (no negatives, for one, and likely not that large), therefore there must be some operation transforming the hash code from its original range to one between 0 and the array's length.

Because HashMap always uses arrays sized to a power of 2 (e.g. 16, 32, 64, etc.) using & is an efficient way to map hash codes to indicies, as it simply strips the other bits. Other hash table implementations might use modulo to similar effect, if they don't limit their array sizes to powers of two.

See also https://en.wikipedia.org/wiki/Hash_table#Collision_resolution

What exactly is bucket in hashmap?

No, a bucket is each element in the array you are referring to. In earlier Java versions, each bucket contained a linked list of Map entries. In new Java versions, each bucket contains either a tree structure of entries or a linked list of entries.

From the implementation notes in Java 8:

/*
* Implementation notes.
*
* This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node). Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.
...

Why AND(&) operator used calculating final index in Hashmap why not modulo(%) operator i.e hashValue & ()

& runs faster, in exchange for working only for powers of two. (Specifically, x & (n - 1) == x % n if x is nonnegative and n is a power of two. x & (n - 1) also does what you want for a hash table -- even if x is negative, x & (n - 1) isn't -- unlike x % n.)

That's the complete and only reason.

What is meant by number of buckets in the HashMap?

Yes, exactly, each bucket can have multiple key-value pairs.

The object's hashCode() determines which bucket it goes into, via this expression: object.hashCode() % n, where n = the total number of buckets and % is the modulus operator.

Most often the objects will be well distributed across buckets, but you have no guarantee where they go. This depends on the data and the hashCode function.

Obviously, when the hashCode implementation is poor, the performance of the hashmap will go down.

Also read up on the equals / hashcode contract, which is relevant.

Java HashMap key hashing

The HashMap will typically perform modulo arithmetic on the value supplied by the hashCode() method, in order to map the full range of integers onto the range of valid indices on its internal array.

This can lead to collisions (different hash values mapping to the same index), which is a separate topic, and also leads to the degradation of the O(1) expected time of a HashMap. One caveat is that it is assumed that your hashCode() implementation is a uniform distribution.

In a good HashMap API, you don't need to worry about the internal array size or collisions. That is handled internally by the implementation in various ways.

There is no "threshold" publicly visible in Java's HashMap API. The closest you can get is:

HashMap(int initialCapacity, float loadFactor)

Here, the threshold you mention is equivalent to the loadFactor. According to the Java docs:

"As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur."

Basically this means that if there are too many collisions in the HashMap, it will expand its current array size and re-hash all the current keys, resulting in a (hopefully) more uniform distribution. This should have the effect that the expected O(1) complexity is maintained across a series of additions or deletions from the HashMap.

How HashMap really works?

The implementation is open source, so I would encourage you to just read the code for any specific questions. But here's the general idea:

  • The primary responsibility for good hashCode distribution lies with the keys' class, not with the HashMap. If the key have a hashCode() method with bad distribution (for instance, return 0;) then the HashMap will perform badly.
  • HashMap does do a bit of "re-hashing" to ensure slightly better distribution, but not much (see HashMap::hash)
  • On the get side of things, a couple checks are made on each element in the bucket (which, yes, is implemented as a linked list)

    • First, the HashMap checks the element's hashCode with the incoming key's hashCode. This is because this operation is quick, and the element's hashCode was cached at put time. This guards against elements that have different hashCodes (and are thus unequal, by the contract of hashCode and equals established by Object) but happen to fall into the same bucket (remember, bucket indexes are basically hashCode % buckets.length)
    • If that succeeds, then, the HashMap checks equals explicitly to ensure they're really equal. Remember that equality implies same hashCode, but same hashCode does not require equality (and can't, since some classes have potentially infinite number of different values -- like String -- but there are only a finite number of possible hashCode values)

The reason for the double-checking of both hashCode and equals is to be both fast and correct. Consider two keys that have a different hashCode, but end up in the same HashMap bucket. For instance, if key A has hashCode=7 and B has hashCode=14, and there are 7 buckets, then they'll both end up in bucket 0 (7 % 7 == 0, and 14 % 7 == 0). Checking the hashCodes there is a quick way of seeing that A and B are unequal. If you find that the hashCodes are equal, then you make sure it's not just a hashCode collision by calling equals. This is just an optimization, really; it's not required by the general hash map algorithm.



Related Topics



Leave a reply



Submit