How Is Gethashcode() Implemented for Int32

How is GetHashCode() implemented for Int32?

According to Reflector:

public override int GetHashCode()
{
return this;
}

Makes sense, does it?

Definition of GetHashCode() in C#

GetHashCode is implemented on your TKey type, not on the Dictionary class.

For int.GetHashCode(), which your example uses, it's defined as:

public override int GetHashCode()
{
return this;
}

Number of backets has nothing to do with the hashcode value.

new KeyValuePairUInt32, UInt32(i, j).GetHashCode(); High Rate of Duplicates

Firstly, we can dispense with the timing aspect of this - it feels to me like this is really just about hash collisions, as obviously those will kill the performance.

So, the question is really why there are more hash collisions for KeyValuePair<uint, uint> than KeyValuePair<ushort, ushort>. To help find out a bit more about that, I've written the following short program:

using System;
using System.Collections.Generic;

class Program
{
const int Sample1 = 100;
const int Sample2 = 213;

public static void Main()
{
Display<uint, ushort>();
Display<ushort, ushort>();
Display<uint, uint>();
Display<ushort, uint>();
}

static void Display<TKey, TValue>()
{
TKey key1 = (TKey) Convert.ChangeType(Sample1, typeof(TKey));
TValue value1 = (TValue) Convert.ChangeType(Sample1, typeof(TValue));
TKey key2 = (TKey) Convert.ChangeType(Sample2, typeof(TKey));
TValue value2 = (TValue) Convert.ChangeType(Sample2, typeof(TValue));

Console.WriteLine("Testing {0}, {1}", typeof(TKey).Name, typeof(TValue).Name);
Console.WriteLine(new KeyValuePair<TKey, TValue>(key1, value1).GetHashCode());
Console.WriteLine(new KeyValuePair<TKey, TValue>(key1, value2).GetHashCode());
Console.WriteLine(new KeyValuePair<TKey, TValue>(key2, value1).GetHashCode());
Console.WriteLine(new KeyValuePair<TKey, TValue>(key2, value2).GetHashCode());
Console.WriteLine();
}
}

The output on my machine is:

Testing UInt32, UInt16
-1888265981
-1888265981
-1888265806
-1888265806

Testing UInt16, UInt16
-466800447
-459525951
-466800528
-459526032

Testing UInt32, UInt32
958334947
958334802
958334802
958334947

Testing UInt16, UInt32
-1913331935
-1913331935
-1913331935
-1913331935

You can obviously try varying the sample values to see where there are collisions.

The results of KeyValuePair<ushort, uint> are particularly worrying, and the results of KeyValuePair<ushort, ushort> are surprisingly good.

In fact, KeyValuePair<ushort, uint> isn't just bad - it's ludicrously bad as far as I can see - I haven't to find any value which doesn't have the same hash code of -1913331935 when running the 64 bit CLR. Running the 32 bit CLR I get a different hash code, but still the same hash code for all values.

It appears that in .NET 4.5 (which is what I'm running) the default implementation of GetHashCode doesn't just take the first instance field of the struct, as previously documented. I suspect that for at least some types, it just uses the first 4 bytes of memory beyond the header in the boxed value (and there will be boxing for every call here), and that ends up sometimes being just the first field (if that field is a uint), sometimes being more than one field (e.g. for ushort, ushort where both fields can fit "inside" 4 bytes) and sometimes being no fields at all (ushort, uint).

(Actually, this doesn't explain why you get 1024 different hash codes in the uint, uint case instead of just 1000. I'm still unsure on that.)

Ultimately, using a value type which doesn't override GetHashCode as a dictionary key seems like it's just a bad idea, unless you've tested to ensure that it's suitable for your specific requirements. There's just too much which is black magic to be confident about it, IMO.

How do you write a GetHashCode method for an object made of a string and a collection of int32?

It's really up to you. I personally would go for something like

public int GetHashCode( ProductWithFeatures obj )
{
string toHash = obj.Name;
foreach( var feature in obj.Features )
toHash += feature.GetHashCode();

return toHash.GetHashCode();
}

It's not the nicest code ever, but it does what it's supposed to do.

What is the best algorithm for overriding GetHashCode?

I usually go with something like the implementation given in Josh Bloch's fabulous Effective Java. It's fast and creates a pretty good hash which is unlikely to cause collisions. Pick two different prime numbers, e.g. 17 and 23, and do:

public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
// Suitable nullity checks etc, of course :)
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
return hash;
}
}

As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good... and although most examples I've seen with small numbers tend to use primes, there are at least similar algorithms where non-prime numbers are often used. In the not-quite-FNV example later, for example, I've used numbers which apparently work well - but the initial value isn't a prime. (The multiplication constant is prime though. I don't know quite how important that is.)

This is better than the common practice of XORing hashcodes for two main reasons. Suppose we have a type with two int fields:

XorHash(x, x) == XorHash(y, y) == 0 for all x, y
XorHash(x, y) == XorHash(y, x) for all x, y

By the way, the earlier algorithm is the one currently used by the C# compiler for anonymous types.

This page gives quite a few options. I think for most cases the above is "good enough" and it's incredibly easy to remember and get right. The FNV alternative is similarly simple, but uses different constants and XOR instead of ADD as a combining operation. It looks something like the code below, but the normal FNV algorithm operates on individual bytes, so this would require modifying to perform one iteration per byte, instead of per 32-bit hash value. FNV is also designed for variable lengths of data, whereas the way we're using it here is always for the same number of field values. Comments on this answer suggest that the code here doesn't actually work as well (in the sample case tested) as the addition approach above.

// Note: Not quite FNV!
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = (int) 2166136261;
// Suitable nullity checks etc, of course :)
hash = (hash * 16777619) ^ field1.GetHashCode();
hash = (hash * 16777619) ^ field2.GetHashCode();
hash = (hash * 16777619) ^ field3.GetHashCode();
return hash;
}
}

Note that one thing to be aware of is that ideally you should prevent your equality-sensitive (and thus hashcode-sensitive) state from changing after adding it to a collection that depends on the hash code.

As per the documentation:

You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:

  • You can compute the hash code from fields that are not mutable; or
  • You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.

The link to the FNV article is broken but here is a copy in the Internet Archive: Eternally Confuzzled - The Art of Hashing

How is Object.GetHashCode() implemented in CLR & JVM?

No, not the address, that can't work with a garbage collector moving objects. It is intuitively simple, it can be a random number as long as it is stored after it is generated. It does get stored in the object, the syncblk. That field stores more than one object property, it is replaced by an index for an allocated syncblk if more than one such property needs to be stored.

The .NET algorithm uses the managed thread ID so that threads are not likely to generate the same sequence:

inline DWORD GetNewHashCode()
{
// Every thread has its own generator for hash codes so that we won't get into a situation
// where two threads consistently give out the same hash codes.
// Choice of multiplier guarantees period of 2**32 - see Knuth Vol 2 p16 (3.2.1.2 Theorem A)
DWORD multiplier = m_ThreadId*4 + 5;
m_dwHashCodeSeed = m_dwHashCodeSeed*multiplier + 1;
return m_dwHashCodeSeed;
}

The seed is stored per-thread so no lock is required. At least that's what is used in the SSCLI20 version. No idea about Java, I imagine it is similar.

GetHashCode() behaviour work when Dictionary has more than Int.MaxValue elements

There is no requirement that if object1.GetHashCode() == object2.GetHashCode(), then object1.Equals(object2). Any container type that uses hash codes must be prepared to deal with hash collisions. One possible way to do that is to store all different objects with the same hash code in a list, and when looking up an object, first look up the hash code, and then iterate over the objects in the associated list, calling Equals for every object until you find a match.

How to implement GetHashCode() in a C# struct

Always the latter. The former isn't sufficient because most bits are 0 (your numbers are most likely small), and those zeroes are in the most significant bits. You'd be wasting a lot of the hash code, thus getting a lot more collisions.

Another common way of doing it is to multiply each item by a prime number and relying on overflows:

return unchecked(FolderID.GetHashCode() * 23 * 23 
+ SubItemKind.GetHashCode() * 23
+ SubItemID.GetHashCode());

Edit: Updated to use unchecked for explicit overflow support as per stakx's comment.



Related Topics



Leave a reply



Submit