How Is Gethashcode() of C# String Implemented

How is GetHashCode() of C# string implemented?

Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs

        public override int GetHashCode() { 

#if FEATURE_RANDOMIZED_STRING_HASHING
if(HashHelpers.s_UseRandomizedStringHashing)
{
return InternalMarvin32HashString(this, this.Length, 0);
}
#endif // FEATURE_RANDOMIZED_STRING_HASHING

unsafe {
fixed (char *src = this) {
Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");

#if WIN32
int hash1 = (5381<<16) + 5381;
#else
int hash1 = 5381;
#endif
int hash2 = hash1;

#if WIN32
// 32 bit machines.
int* pint = (int *)src;
int len = this.Length;
while (len > 2)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
pint += 2;
len -= 4;
}

if (len > 0)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
}
#else
int c;
char *s = src;
while ((c = s[0]) != 0) {
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}
#endif
#if DEBUG
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
#endif
return hash1 + (hash2 * 1566083941);
}
}
}

This is possibly more than you bargained for, I'll annotate the code a bit:

  • The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
  • The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
  • The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
  • The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
  • The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
  • The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
  • The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
  • The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
  • The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.

How do you implement GetHashCode for structure with two string, when both strings are interchangeable

MSDN:

A hash function must have the following properties:

  • If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
  • The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.
  • For the best performance, a hash function must generate a random distribution for all input.

Taking it into account correct way is:

return str1.GetHashCode() ^ str2.GetHashCode() 

^ can be substituted with other commutative operation

GetHashCode() with string keys

You can call GetHashCode() on the non-numeric values that you use in your object.

private string m_foo;
public override int GetHashCode()
{
return m_foo.GetHashCode();
}

GetHashCode for a type holding string fields

There's a fairly simple but effective way to do this:

public override int GetHashCode()
{
unchecked // Hash code calculation can overflow.
{
int hash = 17;

hash = hash * 23 + firstItem.GetHashCode();
hash = hash * 23 + secondItem.GetHashCode();

// ...and so on for each item.

return hash;
}
}

Where firstItem, secondItem and so on are the items which contribute to the hash code. (Larger primes can also be used instead of 17 and 23, but it really doesn't make much difference.)

However note that if you're using .Net Core 3.1, you can do this instead:

public override int GetHashCode() => HashCode.Combine(firstItem, secondItem, ...etc);

Incidentally, if anyone wants to look at the implementation of HashCode.Combine(), it's here.

It's a lot more sophisticated than the code I posted. :)

How to retrieve source code of String.GetHashcode() for 64 bits application?

For those who made the same mistake, I've build this implementation from commentators'documentation. It's the default implementation of the following charachteristics:

  • .NET 4 - 4.7.2
  • 64 bits
  • Release Mode

public static class StringHashExtensions
{
public static unsafe int GetHashCode64BitsRelease(this string str)
{
unsafe
{
fixed (char* src = str)
{
int hash1 = 5381;
int hash2 = hash1;

int c;
char* s = src;
while ((c = s[0]) != 0)
{
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}

return hash1 + (hash2 * 1566083941);
}
}
}
}

C# User class. GetHashCode implementation

The contract for GetHashCode requires (emphasis mine):

The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method.

So basically, you should compute it based on all the used fields in Equals, even though they're mutable. However, the documentation also notes:

If you do choose to override GetHashCode for a mutable reference type, your documentation should make it clear that users of your type should not modify object values while the object is stored in a hash table.

If only some of your properties were mutable, you could potentially override GetHashCode to compute it based only on the immutable ones - but in this case everything is mutable, so you'd basically end up returning a constant, making it awful to be in a hash-based collection.

So I'd suggest one of three options:

  • Use the mutable fields, and document it carefully.
  • Abandon overriding equality/hashing operations
  • Abandon it being mutable

How can I identify a bad implementation of GetHashCode?

GetHashCode should match concept of "equal" for your classes/environment (in addition to be constant while in a container and fast).

In normal cases "equal" is comparing all fields of corresponding objects (value type comparison). In this case simple implementation that somehow merges hash codes of all fields will suffice.

My understanding that in NHibernate's case "equal" is significantly more tricky and as result you see complicated implementation.I believe it is mainly due to the fact that some object properties may not be yet available - in such case comparing "identity" of object is enough.

C# - String.GetHashCode() - don't use as unique identifier

You're right that equal hash codes don't guarantee equal values.

You're wrong in thinking that that quote means otherwise.

This quote is specifically in the context of implementing a hash code calculation for a Person class containing an SSN property. Equal SSN values mean equal Person values. Different SSN values mean different Person values. (Note: this is not necessarily true in reality.)

Now, you need a hash code calculation for Person which guarantees that two equal Person instances have the same hash code, and ideally which makes it likely that two unequal Person instances have a different hash code, although the latter can never be guaranteed. Since equality is defined in terms of the SSN, that means re-using the SSN's hash code achieves that already.

Why is ValueType.GetHashCode() implemented like it is?

UPDATE: This answer was (in part) the basis of a blog article I wrote which goes into more details about the design characteristics of GetHashcode. Thanks for the interesting question!


I didn't implement it and I haven't talked to the people who did. But I can point out a few things.

(Before I go on, note that here I am specifically talking about hash codes for the purposes of balancing hash tables where the contents of the table are chosen by non-hostile users. The problems of hash codes for digital signing, redundancy checking, or ensuring good performance of a hash table when some of the users are mounting denial-of-service attacks against the table provider are beyond the scope of this discussion.)

First, as Jon correctly notes, the given algorithm does implement the required contract of GetHashCode. It might be sub-optimal for your purposes, but it is legal. All that is required is that things that compare equal have equal hash codes.

So what are the "nice to haves" in addition to that contract? A good hash code implementation should be:

1) Fast. Very fast! Remember, the whole point of the hash code in the first place is to rapidly find a relatively empty slot in a hash table. If the O(1) computation of the hash code is in practice slower than the O(n) time taken to do the lookup naively then the hash code solution is a net loss.

2) Well distributed across the space of 32 bit integers for the given distribution of inputs. The worse the distribution across the ints, the more like a naive linear lookup the hash table is going to be.

So, how would you make a hash algorithm for arbitrary value types given those two conflicting goals? Any time you spend on a complex hash algorithm that guarantees good distribution is time poorly spent.

A common suggestion is "hash all of the fields and then XOR together the resulting hash codes". But that is begging the question; XORing two 32 bit ints only gives good distribution when the inputs themselves are extremely well-distributed and not related to each other, and that is an unlikely scenario:

// (Updated example based on good comment!)
struct Control
{
string name;
int x;
int y;
}

What is the likelihood that x and y are well-distributed over the entire range of 32 bit integers? Very low. Odds are much better that they are both small and close to each other, in which case xoring their hash codes together makes things worse, not better. xoring together integers that are close to each other zeros out most of the bits.

Furthermore, this is O(n) in the number of fields! A value type with a lot of small fields would take a comparatively long time to compute the hash code.

Basically the situation we're in here is that the user didn't provide a hash code implementation themselves; either they don't care, or they don't expect this type to ever be used as a key in a hash table. Given that you have no semantic information whatsoever about the type, what's the best thing to do? The best thing to do is whatever is fast and gives good results most of the time.

Most of the time, two struct instances that differ will differ in most of their fields, not just one of their fields, so just picking one of them and hoping that it's the one that differs seems reasonable.

Most of the time, two struct instances that differ will have some redundancy in their fields, so combining the hash values of many fields together is likely to decrease, not increase, the entropy in the hash value, even as it consumes the time that the hash algorithm is designed to save.

Compare this with the design of anonymous types in C#. With anonymous types we do know that it is highly likely that the type is being used as a key to a table. We do know that it is highly likely that there will be redundancy across instances of anonymous types (because they are results of a cartesian product or other join). And therefore we do combine the hash codes of all of the fields into one hash code. If that gives you bad performance due to the excess number of hash codes being computed, you are free to use a custom nominal type rather than the anonymous type.



Related Topics



Leave a reply



Submit