What's the Best Strategy for Equals and Gethashcode

What's the best strategy for Equals and GetHashCode?

Assuming that the instances are equal because the hash codes are equal is wrong.

I guess your implementation of GetHashCode is OK, but I usually use things similar to this:

public override int GetHashCode() {
return object1.GetHashCode ^ intValue1 ^ (intValue2 << 16);
}

What is the best algorithm for overriding GetHashCode?

I usually go with something like the implementation given in Josh Bloch's fabulous Effective Java. It's fast and creates a pretty good hash which is unlikely to cause collisions. Pick two different prime numbers, e.g. 17 and 23, and do:

public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
// Suitable nullity checks etc, of course :)
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
return hash;
}
}

As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good... and although most examples I've seen with small numbers tend to use primes, there are at least similar algorithms where non-prime numbers are often used. In the not-quite-FNV example later, for example, I've used numbers which apparently work well - but the initial value isn't a prime. (The multiplication constant is prime though. I don't know quite how important that is.)

This is better than the common practice of XORing hashcodes for two main reasons. Suppose we have a type with two int fields:

XorHash(x, x) == XorHash(y, y) == 0 for all x, y
XorHash(x, y) == XorHash(y, x) for all x, y

By the way, the earlier algorithm is the one currently used by the C# compiler for anonymous types.

This page gives quite a few options. I think for most cases the above is "good enough" and it's incredibly easy to remember and get right. The FNV alternative is similarly simple, but uses different constants and XOR instead of ADD as a combining operation. It looks something like the code below, but the normal FNV algorithm operates on individual bytes, so this would require modifying to perform one iteration per byte, instead of per 32-bit hash value. FNV is also designed for variable lengths of data, whereas the way we're using it here is always for the same number of field values. Comments on this answer suggest that the code here doesn't actually work as well (in the sample case tested) as the addition approach above.

// Note: Not quite FNV!
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = (int) 2166136261;
// Suitable nullity checks etc, of course :)
hash = (hash * 16777619) ^ field1.GetHashCode();
hash = (hash * 16777619) ^ field2.GetHashCode();
hash = (hash * 16777619) ^ field3.GetHashCode();
return hash;
}
}

Note that one thing to be aware of is that ideally you should prevent your equality-sensitive (and thus hashcode-sensitive) state from changing after adding it to a collection that depends on the hash code.

As per the documentation:

You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:

  • You can compute the hash code from fields that are not mutable; or
  • You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.

The link to the FNV article is broken but here is a copy in the Internet Archive: Eternally Confuzzled - The Art of Hashing

Why should GetHashCode implement the same logic as Equals?

When storing a value in a hash table, such as Dictionary<>, the framework will first call GetHashCode() and check if there's already a bucket in the hash table for that hash code. If there is, it will call .Equals() to see if the new value is indeed equal to the existing value. If not (meaning the two objects are different, but result in the same hash code), you have what's known as a collision. In this case, the items in this bucket are stored as a linked list and retrieving a certain value becomes O(n).

If you implemented GetHashCode() but did not implement Equals(), the framework would resort to using reference equality to check for equality which would result in every instance creating a collision.

If you implemented Equals() but did not implement GetHashCode(), you might run into a situation where you had two objects that were equal, but resulted in different hash codes meaning they'd maintain their own separate values in your hash table. This would potentially confuse anyone using your class.

As far as what objects are considered equal, that's up to you. If I create a hash table based on temperature, should I be able to refer to the same item using either its Celsius or Fahrenheit value? If so, they need to result in the same hash value and Equals() needs to return true.

Update:

Let's step back and take a look at the purpose of a hash code in the first place. Within this context, a hash code is used as a quick way to identify if two objects are most likely equal. If we have two objects that have different hash codes, we know for a fact they are not equal. If we have two objects that have the same hash code, we know they are most likely equal. I say most likely because an int can only be used to represent a few billion possible values, and strings can of course contain the complete works of Charles Dickens, or any number of possible values. Much in the .NET framework is based on these truths, and developers that use your code will assume things work in a way that is consistent with the rest of the framework.

If you were to have two instances that have different hash codes, but have an implementation of Equals() that returns true, you're breaking this convention. A developer that compares two objects might then use one of of those objects to refer to a key in a hash table and expect to get an existing value out. If all of a sudden the hash code is different, this code might result in a runtime exception instead. Or perhaps return a reference to a completely different object.

Whether 295.15k and 22C are equal within the domain of your program is your choice (In my opinion, they are not). However, whatever you decide, objects that are equal must return the same has code.

General advice and guidelines on how to properly override object.GetHashCode()

Table of contents

  • When do I override object.GetHashCode?

  • Why do I have to override object.GetHashCode()?

  • What are those magic numbers seen in GetHashCode implementations?


Things that I would like to be covered, but haven't been yet:

  • How to create the integer (How to "convert" an object into an int wasn't very obvious to me anyways).
  • What fields to base the hash code upon.

    • If it should only be on immutable fields, what if there are only mutable ones?
  • How to generate a good random distribution. (MSDN Property #3)

    • Part to this, seems to choose a good magic prime number (have seen 17, 23 and 397 been used), but how do you choose it, and what is it for exactly?
  • How to make sure the hash code stays the same all through the object lifetime. (MSDN Property #2)

    • Especially when the equality is based upon mutable fields. (MSDN Property #1)
  • How to deal with fields that are complex types (not among the built-in C# types).

    • Complex objects and structs, arrays, collections, lists, dictionaries, generic types, etc.
    • For example, even though the list or dictionary might be readonly, that doesn't mean the contents of it are.
  • How to deal with inherited classes.

    • Should you somehow incorporate base.GetHashCode() into your hash code?
  • Could you technically just be lazy and return 0? Would heavily break MSDN guideline number #3, but would at least make sure #1 and #2 were always true :P
  • Common pitfalls and gotchas.

Equals vs GetHashCode when comparing objects

Here's what you need to understand about the relationship between Equals and GetHashCode.

Hash codes are used by hashtables to quickly find a "bucket" in which an element is expected to exist. If elements are in two different buckets, the assumption is they cannot be equal.

The outcome of this is that you should view a hash code, for purposes of determining uniqueness, as a quick negative check: i.e., if two objects have different hash codes, they are not the same (regardless of what their Equals methods return).

If two objects have the same hash code, they will reside in the same bucket of a hash table. Then their Equals methods will be called to determine equality.

So GetHashCode must return the same value for two objects that you want to be considered equal.

Correct way to override Equals() and GetHashCode()

You can override Equals() and GetHashCode() on your class like this:

public override bool Equals(object obj)
{
var item = obj as RecommendationDTO;

if (item == null)
{
return false;
}

return this.RecommendationId.Equals(item.RecommendationId);
}

public override int GetHashCode()
{
return this.RecommendationId.GetHashCode();
}

Why use GetHashCode() over Equals()?

For some types, an Equals test may be relatively expensive. It typically has to compare every field of the class. In other words, it takes linear time in the size of the class. Bigger classes are more expensive to compare for equality.

Now, what happens if you need to need to compare one object against 1000 others? Calling Equals 1000 times might get expensive. You need to do N*2000 field accesses, if N is the size of the class

GetHashCode instead generates a "mostly unique" integer based on the contents of the class. In other words, the class fields are accessed once. And once you have that, you can compare this integer to the 1000 integers that make up the other objects' hash codes.

Even in such a naive use case, we now only need N*1000 field accesses.

But what if we store the hash code? When we insert an object into a hash set, its hash code is computed once. Now, any time we want to do a lookup in the hash set, we just need to compute one hash code (that of the external object), and then you just have to compare simple integers.
So N class field accesses (for the new object whose hash code we need to compute), plus a number of integer comparisons, which vary, depending on the algorithm, but are 1) relatively few, and 2) cheap.

Why is it important to override GetHashCode when Equals method is overridden?

Yes, it is important if your item will be used as a key in a dictionary, or HashSet<T>, etc - since this is used (in the absence of a custom IEqualityComparer<T>) to group items into buckets. If the hash-code for two items does not match, they may never be considered equal (Equals will simply never be called).

The GetHashCode() method should reflect the Equals logic; the rules are:

  • if two things are equal (Equals(...) == true) then they must return the same value for GetHashCode()
  • if the GetHashCode() is equal, it is not necessary for them to be the same; this is a collision, and Equals will be called to see if it is a real equality or not.

In this case, it looks like "return FooId;" is a suitable GetHashCode() implementation. If you are testing multiple properties, it is common to combine them using code like below, to reduce diagonal collisions (i.e. so that new Foo(3,5) has a different hash-code to new Foo(5,3)):

In modern frameworks, the HashCode type has methods to help you create a hashcode from multiple values; on older frameworks, you'd need to go without, so something like:

unchecked // only needed if you're compiling with arithmetic checks enabled
{ // (the default compiler behaviour is *disabled*, so most folks won't need this)
int hash = 13;
hash = (hash * 7) + field1.GetHashCode();
hash = (hash * 7) + field2.GetHashCode();
...
return hash;
}

Oh - for convenience, you might also consider providing == and != operators when overriding Equals and GetHashCode.


A demonstration of what happens when you get this wrong is here.

Using GetHashCode to test equality in Equals override

The others are right; your equality operation is broken. To illustrate:

public static void Main()
{
var c1 = new Class1() { A = "apahaa", B = null };
var c2 = new Class1() { A = "abacaz", B = null };
Console.WriteLine(c1.Equals(c2));
}

I imagine you want the output of that program to be "false" but with your definition of equality it is "true" on some implementations of the CLR.

Remember, there are only about four billion possible hash codes. There are way more than four billion possible six letter strings, and therefore at least two of them have the same hash code. I've shown you two; there are infinitely many more.

In general you can expect that if there are n possible hash codes then the odds of getting a collision rise dramatically once you have about the square root of n elements in play. This is the so-called "birthday paradox". For my article on why you shouldn't rely upon hash codes for equality, click here.



Related Topics



Leave a reply



Submit