What Is the Significance of Load Factor in Hashmap

What is the significance of load factor in HashMap?

The documentation explains it pretty well:

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

As with all performance optimizations, it is a good idea to avoid optimizing things prematurely (i.e. without hard data on where the bottlenecks are).

why is loadFactor in HashMap is set to 0.75 by default?

Because even with a very good hashcode implementation, there will be collisions (collision means two elements fitting in the same bucket). More the number of elements in map, more chances for collisions. When the map is 75% filled, the frequency of collisions increases. It is advisable to have a load factor of around 0.75 for keeping the put and get complexity around O(1).

What happens if the load factor of a HashMap is greater than 1?

What if I set the value of load factor greater than 1 for example lets say 2 (super(capacity+1, 2.0f, true);)

You already have the answer;

... it will re-hash the hash map once the 200% of capacity of hashmap is filled.

The hashing works the same, it just uses a smaller capacity which impacts performance. If you make the initial capacity large enough the load factor never comes into play. The load factor only applies when the map is resized.

Note: the actual capacity is always a power of 2.

I suggest you try it.

BTW Changing the load factor can change the order the elements appear as there is less buckets. Trying printing out the Set or Map and comparing.

Why would a higher load factor in HashMap increase lookup cost?

Hash table's Load Factor is defined as

n/s, the ratio of the number of stored entries n and the size s of the table's array of buckets.

High performance of hash table is maintained when the number of collisions is low. When the load factor is high, the number of hash buckets needed to store the same number of entries remains lower, thus increasing the probability of collisions.

Impact of load factor on lookup time?

So how come collision is related to free buckets ? Do you mean to say that even if hashcode is different and hashmap is full, then it will try to fit it in existing bucket ?

The hash code of a key can be any int value from 0 to 2³¹-1. This means that the value of the hashCode() method is usually larger than the number of buckets. Therefore two keys of different hash codes may be mapped to the same bucket.

For example, if the number of buckets is 16, hash codes 1, 17, 33, 49, etc... will all be mapped to bucket #1.

If there are more buckets, a smaller number of unique hash codes can be mapped into the same bucket, so there is less chance of the same bucket holding multiple entries.

For example, if the number of buckets is increased to 32, now hash codes 1 & 33 will still be mapped to bucket #1, but hash codes 17, 49, etc... will be mapped to a different bucket (#17) - hence less chance of collision.

When the load factor < 1, the average number of entries in each bucket is < 1, which gives you a good chance of constant lookup time (unless your key has a terrible hashCode implementation that returns few unique values).

Why load factor of hashmap in redis is as large as 5

The load factor to start rehashing is ~1. The dict_force_resize_ratio value is a safety measure such that even if rehashing is disabled, once it gets to that load factor it will force it.

You can see this in _dictExpandIfNeeded(dict *d) in dict.c

/* If we reached the 1:1 ratio, and we are allowed to resize the hash
 * table (global setting) or we should avoid it but the ratio between
 * elements/buckets is over the "safe" threshold, we resize doubling
 * the number of buckets. */
if (d->ht[0].used >= d->ht[0].size &&
    (dict_can_resize ||
     d->ht[0].used/d->ht[0].size > dict_force_resize_ratio))
{
    return dictExpand(d, d->ht[0].used*2);
}

Redis allows ~1 to start rehashing since the rehashing is not done all at once. It is progressively done by maintaining two hash tables.

See dict.h:

/* This is our hash table structure. Every dictionary has two of this as we
 * implement incremental rehashing, for the old to the new table. */
typedef struct dictht {
    dictEntry **table;
    unsigned long size;
    unsigned long sizemask;
    unsigned long used;
} dictht;

typedef struct dict {
    dictType *type;
    void *privdata;
    dictht ht[2];
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    unsigned long iterators; /* number of iterators currently running */
} dict;

And in dict.c:

/* Performs N steps of incremental rehashing. Returns 1 if there are still
 * keys to move from the old to the new hash table, otherwise 0 is returned.
 *
 * Note that a rehashing step consists in moving a bucket (that may have more
 * than one key as we use chaining) from the old to the new hash table, however
 * since part of the hash table may be composed of empty spaces, it is not
 * guaranteed that this function will rehash even a single bucket, since it
 * will visit at max N*10 empty buckets in total, otherwise the amount of
 * work it does would be unbound and the function may block for a long time. */
int dictRehash(dict *d, int n) {...

And there is some additional insight in redis.conf, for the activerehashing setting.

# Active rehashing uses 1 millisecond every 100 milliseconds of CPU time in
# order to help rehashing the main Redis hash table (the one mapping top-level
# keys to values). The hash table implementation Redis uses (see dict.c)
# performs a lazy rehashing: the more operation you run into a hash table
# that is rehashing, the more rehashing "steps" are performed, so if the
# server is idle the rehashing is never complete and some more memory is used
# by the hash table.
#
# The default is to use this millisecond 10 times every second in order to
# actively rehash the main dictionaries, freeing memory when possible.
#
# If unsure:
# use "activerehashing no" if you have hard latency requirements and it is
# not a good thing in your environment that Redis can reply from time to time
# to queries with 2 milliseconds delay.
#
# use "activerehashing yes" if you don't have such hard requirements but
# want to free memory asap when possible.
activerehashing yes

Performance of HashMap with different initial capacity and load factor

In the absence of a perfect hashing function for your data, and assuming that this is really not a micro-optimization of something that really doesn't matter, I would try the following:

Assume the default load capacity (.75) used by HashMap is a good value in most situations. That being the case, you can use it, and set the initial capacity of your HashMap based on your own knowledge of how many items it will hold - set it so that initial-capacity x .75 = number of items (round up).

If it were a larger map, in a situation where high-speed lookup was really critical, I would suggest using some sort of trie rather than a hash map. For long strings, in large maps, you can save space, and some time, by using a more string-oriented data structure, such as a trie.

Should you define initial capacity and load factor of a HashMap, if the number of elements is known in advance?

Use an EnumMap instead, which is a high-performance Map specifically designed for enum keys:

Map<EnumType, Integer> counts = new EnumMap<>();

What Is the Significance of Load Factor in Hashmap