General Advice and Guidelines on How to Properly Override Object.Gethashcode()

General advice and guidelines on how to properly override object.GetHashCode()

Table of contents

  • When do I override object.GetHashCode?

  • Why do I have to override object.GetHashCode()?

  • What are those magic numbers seen in GetHashCode implementations?


Things that I would like to be covered, but haven't been yet:

  • How to create the integer (How to "convert" an object into an int wasn't very obvious to me anyways).
  • What fields to base the hash code upon.

    • If it should only be on immutable fields, what if there are only mutable ones?
  • How to generate a good random distribution. (MSDN Property #3)

    • Part to this, seems to choose a good magic prime number (have seen 17, 23 and 397 been used), but how do you choose it, and what is it for exactly?
  • How to make sure the hash code stays the same all through the object lifetime. (MSDN Property #2)

    • Especially when the equality is based upon mutable fields. (MSDN Property #1)
  • How to deal with fields that are complex types (not among the built-in C# types).

    • Complex objects and structs, arrays, collections, lists, dictionaries, generic types, etc.
    • For example, even though the list or dictionary might be readonly, that doesn't mean the contents of it are.
  • How to deal with inherited classes.

    • Should you somehow incorporate base.GetHashCode() into your hash code?
  • Could you technically just be lazy and return 0? Would heavily break MSDN guideline number #3, but would at least make sure #1 and #2 were always true :P
  • Common pitfalls and gotchas.

When to override GetHashCode()?

When you override Equals, basically. When you want to provide a different idea of equality than simple reference equality.

String is a good example of this - two strings are equal (under a simple Equals call) if they represent the same sequence of characters. The hash code reflects this, such that if two strings are equal they will have the same hash code. (The reverse isn't necessarily true - two unequal strings can have the same hash code, but it's unlikely.)

(Strings are tricky in other ways, mind you - there are lots of different ideas of equality based on culture and casing, but String.Equals just looks at the UTF-16 code points which make up the string, and compares them in the simplest conceivable fashion.)

Do I need to override GetHashCode() on reference types?

You only need to override GetHashCode() on reference types if you override Object.Equals().

The reason for this is simple - normally, 2 references will always be distinct (a.Equals(b)==false, unless they're the same object). The default implementation of GetHashCode() will provide 2 distinct hashes in this case, so all is good.

If you override Equals(), though, this behavior is not guaranteed. If two objects are equal (as per Equals()), you need to guarantee that they'll have the same hash code with GetHashCode, so you should override it.

how to implement override of GetHashCode() with logic of overriden Equals()

Firstly, as I think you understand, wherever you implement Equals you MUST also implement GetHashCode. The implementation of GetHashCode must reflect the behaviour of the Equals implementation but it doesn't usually use it.

See http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx - especially the "Notes to Implementers"

So if you take your example of the Item implementation of Equals, you're considering both the values of id and name to affect equality. So both of these must contribute to the GetHashCode implementation.

An example of how you could implement GetHashCode for Item would be along the lines of the following (note you may need to make it resilient to a nullable name field):

public override GetHashCode()
{
return id.GetHashCode() ^ name.GetHashCode();
}

See Eric Lippert's blog post on guidelines for GetHashCode - http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/

As for whether you need to re-implement GetHashCode in subclasses - Yes if you also override Equals - as per the first (and main) point - the implementation of the two must be consistent - if two items are considered equal by Equals then they must return the same value from GetHashCode.

Side note:
As a performance improvement on your code (avoid multiple casts):

if ( obj is Param){
Param p = (Param)(obj);

Param p = obj as Param;
if (p != null) ...

What is the best algorithm for overriding GetHashCode?

I usually go with something like the implementation given in Josh Bloch's fabulous Effective Java. It's fast and creates a pretty good hash which is unlikely to cause collisions. Pick two different prime numbers, e.g. 17 and 23, and do:

public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
// Suitable nullity checks etc, of course :)
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
return hash;
}
}

As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good... and although most examples I've seen with small numbers tend to use primes, there are at least similar algorithms where non-prime numbers are often used. In the not-quite-FNV example later, for example, I've used numbers which apparently work well - but the initial value isn't a prime. (The multiplication constant is prime though. I don't know quite how important that is.)

This is better than the common practice of XORing hashcodes for two main reasons. Suppose we have a type with two int fields:

XorHash(x, x) == XorHash(y, y) == 0 for all x, y
XorHash(x, y) == XorHash(y, x) for all x, y

By the way, the earlier algorithm is the one currently used by the C# compiler for anonymous types.

This page gives quite a few options. I think for most cases the above is "good enough" and it's incredibly easy to remember and get right. The FNV alternative is similarly simple, but uses different constants and XOR instead of ADD as a combining operation. It looks something like the code below, but the normal FNV algorithm operates on individual bytes, so this would require modifying to perform one iteration per byte, instead of per 32-bit hash value. FNV is also designed for variable lengths of data, whereas the way we're using it here is always for the same number of field values. Comments on this answer suggest that the code here doesn't actually work as well (in the sample case tested) as the addition approach above.

// Note: Not quite FNV!
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = (int) 2166136261;
// Suitable nullity checks etc, of course :)
hash = (hash * 16777619) ^ field1.GetHashCode();
hash = (hash * 16777619) ^ field2.GetHashCode();
hash = (hash * 16777619) ^ field3.GetHashCode();
return hash;
}
}

Note that one thing to be aware of is that ideally you should prevent your equality-sensitive (and thus hashcode-sensitive) state from changing after adding it to a collection that depends on the hash code.

As per the documentation:

You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:

  • You can compute the hash code from fields that are not mutable; or
  • You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.

The link to the FNV article is broken but here is a copy in the Internet Archive: Eternally Confuzzled - The Art of Hashing

GetHashCode Guidelines in C#

The answer is mostly, it is a valid guideline, but perhaps not a valid rule. It also doesn't tell the whole story.

The point being made is that for mutable types, you cannot base the hash code on the mutable data because two equal objects must return the same hash code and the hash code has to be valid for the lifetime of the object. If the hash code changes, you end up with an object that gets lost in a hashed collection because it no longer lives in the correct hash bin.

For example, object A returns hash of 1. So, it goes in bin 1 of the hash table. Then you change object A such that it returns a hash of 2. When a hash table goes looking for it, it looks in bin 2 and can't find it - the object is orphaned in bin 1. This is why the hash code must not change for the lifetime of the object, and just one reason why writing GetHashCode implementations is a pain in the butt.

Update
Eric Lippert has posted a blog that gives excellent information on GetHashCode.

Additional Update
I've made a couple of changes above:

  1. I made a distinction between guideline and rule.
  2. I struck through "for the lifetime of the object".

A guideline is just a guide, not a rule. In reality, GetHashCode only has to follow these guidelines when things expect the object to follow the guidelines, such as when it is being stored in a hash table. If you never intend to use your objects in hash tables (or anything else that relies on the rules of GetHashCode), your implementation doesn't need to follow the guidelines.

When you see "for the lifetime of the object", you should read "for the time the object needs to co-operate with hash tables" or similar. Like most things, GetHashCode is about knowing when to break the rules.

What are the rules I should follow to ensure GetHashCode() method returns unique value for an object?

You shouldn't even aim for GetHashCode() returning a unique value for each object. That's not the point of GetHashCode().

Eric Lippert has a great post about hash codes which you should read thoroughly. Basically you want to end up with something which will always return the same value for two equal objects (and you need to work out what you mean by equal) and is likely to return different values for two non-equal objects.

Personally I tend to use an implementation like this:

public override int GetHashCode()
{
int hash = 17;
hash = hash * 31 + field1.GetHashCode();
hash = hash * 31 + field2.GetHashCode();
hash = hash * 31 + field3.GetHashCode();
...
return hash;
}

Things to watch out for:

  • If you have mutable objects, be careful! You shouldn't mutate an object after using it as the key in a hash map.
  • If your fields can be null, you need to check for that while calculating your hash. For example:

    hash = hash * 31 + (field2 == null ? 0 : field2.GetHashCode());

C# – How to override GetHashCode for ListT to calculate ETag with T being a record

A little extension method and HashCode could help with this:

internal static class EnumerableExtensions {
public static int GetCombinedHashCode<T>(this IEnumerable<T> source) =>
source.Aggregate(typeof(T).GetHashCode(), (hash, t) => HashCode.Combine(hash, t));
}

Seeding the hash with typeof(T).GetHashCode is a rather arbitrary, but ensures that empty collections of different types do not all "look equal", since they would not normally compare equal either. Whether this matters or is even desirable will depend on your scenario.

Of course the result of this is only usable if T has a meaningful GetHashCode implementation, but that's true of hashes in general. For extra peace of mind a where T : IEquatable<T> constraint could be added, although that's not the standard approach for methods involving hashes. Adding the ability to use a custom IEqualityComparer<T> for the hash is left as an exercise.

Object.GetHashCode

The most important property a hash code implementation must have is this:

If two objects compare as equal then they must have identical hash codes.

If you have a class where instances of the class compare by reference equality, then you do not need to override GetHashCode; the default implementation guarantees that two objects that are the same reference have the same hash code. (You're calling the same method twice on the same object, so of course the result is the same.)

If you have written a class which implements its own equality that is different from reference equality then you are REQUIRED to override GetHashCode such that two objects that compare as equal have equal hash codes.

Now, you could do so by simply returning zero every time. That would be a lousy hash function, but it would be legal.

Other properties of good hash functions are:

  • GetHashCode should never throw an exception

  • Mutable objects which compare for equality on their mutable state, and therefore hash on their mutable state, are dangerously bug-prone. You can put an object into a hash table, mutate it, and be unable to get it out again. Try to never hash or compare for equality on mutable state.

  • GetHashCode should be extremely fast -- remember, the purpose of a good hash algorithm is to improve the performance of lookups. If the hash is slow then the lookups can't be made fast.

  • Objects which do not compare as equal should have dissimilar hash codes, well distributed over the whole range of a 32 bit integer

What can go wrong if one fails to override GetHashCode() when overriding Equals()?

The most visible way is for mapping structures.

Any class which does this will have unpredictable behavior when used as the Key for a Dictionary or HashTable. The reason being is that the implementation uses both GetHashCode and Equals to properly find a value in the table. The short version of the algorithm is the following

  1. Take the modulus of the HashCode by the number of buckets and that's the bucket index
  2. Call .Equals() for the specified Key and every Key in the particular bucket.
  3. If there is a match that is the value, no match = no value.

Failing to keep GetHashCode and Equals in sync will completely break this algorithm (and numerous others).



Related Topics



Leave a reply



Submit