Why Doesn't String's Hashcode() Cache 0

Why doesn't String's hashCode() cache 0?

You're worrying about nothing. Here's a way to think about this issue.

Suppose you have an application that does nothing but sit around hashing Strings all year long. Let's say it takes a thousand strings, all in memory, calls hashCode() on them repeatedly in round-robin fashion, a million times through, then gets another thousand new strings and does it again.

And suppose that the likelihood of a string's hash code being zero were, in fact, much greater than 1/2^32. I'm sure it is somewhat greater than 1/2^32, but let's say it's a lot worse than that, like 1/2^16 (the square root! now that's a lot worse!).

In this situation, you have more to benefit from Oracle's engineers improving how these strings' hash codes are cached than anyone else alive. So you write to them and ask them to fix it. And they work their magic so that whenever s.hashCode() is zero, it returns instantaneously (even the first time! a 100% improvement!). And let's say that they do this without degrading the performance at all for any other case.

Hooray! Now your app is... let's see... 0.0015% faster!

What used to take an entire day now takes only 23 hours, 57 minutes and 48 seconds!

And remember, we set up the scenario to give every possible benefit of the doubt, often to a ludicrous degree.

Does this seem worth it to you?

EDIT: since posting this a couple hours ago, I've let one of my processors run wild looking for two-word phrases with zero hash codes. So far it's come up with: bequirtle zorillo, chronogrammic schtoff, contusive cloisterlike, creashaks organzine, drumwood boulderhead, electroanalytic exercisable, and favosely nonconstruable. This is out of about 2^35 possibilities, so with perfect distribution we'd expect to see only 8. Clearly by the time it's done we'll have a few times that many, but not outlandishly more. What's more significant is that I've now come up with a few interesting band names/album names! No fair stealing!

Why might a System.String object not cache its hash code?

Obvious potential answer: because that will cost memory.

There's a cost/benefit analysis here:

Cost: 4 bytes for every string (and a quick test on each call to GetHashCode). Also make the string object mutable, which would obviously mean you'd need to be careful about the
implementation - unless you always compute the hash code up-front, which is a cost of computing it once for every string, regardless of whether you ever hash it at all.

Benefit: Avoid recomputing the hash for string values hashed more than once

I would suggest that in many cases, there are many, many string objects and very few of them are hashed more than once - leading to a net cost. For some cases, obviously that won't be the case.

I don't think I'm in a good position to judge which comes up more often... I would hope that MS has instrumented various real apps. (I'd also hope that Sun did the same for Java, which does cache the hash...)

EDIT: I've just spoken to Eric Lippert about this (NDC is awesome :) and basically it is about the extra memory hit vs the limited benefits.

Why String hashCode doesn't have size limitation?

Several possible reasons spring to mind:

  1. It is common for Strings to vary only at the start or end, e.g. all StackOverflow question URLs start with "https://stackoverflow.com/questions/". Limiting the hashCode to only a subset of characters will therefore cause unnecessary collisions, and for some sets of strings cause many collisions. Your proposed algorithm would cause every stackoverflow question URL to have the same hashCode!

  2. hashCode is fast and memoized, it's not clear that limiting the hashCode to some constant length will bring noticeable performance improvements, especially since it is always preceded by creating the String (an O(n) operation), and often followed by a call to equals (also O(n)).

  3. Legacy reasons. String.hashcode is specified to use a specific algorithm. Existing applications rely on this specification. Even if this optimisation was now deemed necessary it couldn't be made without breaking backwards compatibility.

Java String hashcode caching

In this case how cache is handled for string which has same hashcode?

I don't understand the first part of your question. Cache is handled for all Strings the same, whether hashcodes are the same or not (since two different Strings can theoretically have the same hashCode, and so if hashCodes are equal, it doesn't mean that the Strings are equal). But if the same String object is used, the hashCode does not have to be recalculated since it is cached.

Does it really improve the performance?

Unequivocally YES

java String hashcode caching mechanism

Simply because hash value changes in the loop and your solution without intermediate temporary variable is not thread-safe. Consider that this method is invoked in several threads.

Say thread-1 started hash computation and it is not 0 anymore. Some small moment later thread-2 invokes the same method hashCode() on the same object and sees that hash is not 0, but thread-1 hasn't yet finished its computation. As the result, in the thread-2 wrong hash (not fully computed) value will be used.



Related Topics



Leave a reply



Submit