How to Create a Hashcode in .Net (C#) for a String That Is Safe to Store in a Database

How do I create a HashCode in .net (c#) for a string that is safe to store in a database?

It depends what properties you want that hash to have. For example, you could just write something like this:

public int HashString(string text)
{
    // TODO: Determine nullity policy.

    unchecked
    {
        int hash = 23;
        foreach (char c in text)
        {
            hash = hash * 31 + c;
        }
        return hash;
    }
}

So long as you document that that is how the hash is computed, that's valid. It's in no way cryptographically secure or anything like that, but you can persist it with no problems. Two strings which are absolutely equal in the ordinal sense (i.e. with no cultural equality etc applied, exactly character-by-character the same) will produce the same hash with this code.

The problems come when you rely on undocumented hashing - i.e. something which obeys GetHashCode() but is in no way guaranteed to remain the same from version to version... like string.GetHashCode().

Writing and documenting your own hash like this is a bit like saying, "This sensitive information is hashed with MD5 (or whatever)". So long as it's a well-defined hash, that's fine.

EDIT: Other answers have suggested using cryptographic hashes such as SHA-1 or MD5. I would say that until we know there's a requirement for cryptographic security rather than just stability, there's no point in going through the rigmarole of converting the string to a byte array and hashing that. Of course if the hash is meant to be used for anything security-related, an industry-standard hash is exactly what you should be reaching for. But that wasn't mentioned anywhere in the question.

Creating a hashcode for use in a database (ie not using GetHashCode)

I would encourage you to consider what the others have said: let the database do what it's good at. Creating a hash code in order to optimize lookups is an indication that the indexes on your table aren't what they should be.

That said, if you really need a hash code:

You don't say if you want a 32-bit or 64-bit hash code. This one will create a 64-bit hash code for a string. It's reasonably collision-resistant.

public static long ComputeHashCode(string url)
{
    const ulong p = 1099511628211;

    ulong hash = 14695981039346656037;

    for (int i = 0; i < url.Length; ++i)
    {
        hash = (hash ^ url[i]) * p;
    }

    // Wang64 bit mixer
    hash = (~hash) + (hash << 21);
    hash = hash ^ (hash >> 24);
    hash = (hash + (hash << 3)) + (hash << 8);
    hash = hash ^ (hash >> 14);
    hash = (hash + (hash << 2)) + (hash << 4);
    hash = hash ^ (hash >> 28);
    hash = hash + (hash << 31);

    if (hash == (ulong)UNKNOWN_RECORD_HASH)
    {
        ++hash;
    }
    return (long)hash;
}

Note that this is a hash code and the likelihood of a collision is pretty small if you have up to a few billion records. Rule of thumb: you have a 50% chance of collision when the number of items exceeds the square root of your hash code's range. This hash code has a range of 2^64, so if you have 2^32 items, your chance of a collision is about 50%.

See http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=792 and http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table for more information.

Persistent hashcode for strings

There is no built in, cross version stable, way to get a hash code of a string.

You could just copy the existing GetHashCode() code but exclude the portion that adds the build number as the seed and don't use unsafe calls to keep yourself safe from implementation detail changes.

Here is a fully managed version of the 64bit GetHashCode() that does not use any randomization and will return the same value for all future versions of .NET (as long as the behavior of int ^ char never changes).

public static class StringExtensionMethods
{
    public static int GetStableHashCode(this string str)
    {
        unchecked
        {
            int hash1 = 5381;
            int hash2 = hash1;

            for(int i = 0; i < str.Length && str[i] != '\0'; i += 2)
            {
                hash1 = ((hash1 << 5) + hash1) ^ str[i];
                if (i == str.Length - 1 || str[i+1] == '\0')
                    break;
                hash2 = ((hash2 << 5) + hash2) ^ str[i+1];
            }

            return hash1 + (hash2*1566083941);
        }
    }
}

Storing C# GetHashCode() in DB is Unreliable

You could always use an MD5 hash instead, which is relatively fast:

public string GetUrlHash(string url) {

    byte[] hash = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(url));

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < hash.Length; i++) {
        sb.Append(hash[i].ToString("X2"));
    }

    return sb.ToString();
}

Call it like so:

Console.WriteLine(this.GetUrlHash("http://stackoverflow.com/questions/5355003/storing-c-gethashcode-in-db-is-unreliable"));

And get:

> 777BED7F83C66DAC111977067B4B4385

This should be fairly reliable from an uniqueness standpoint. MD5 is insecure nowadays for password applications but you don't have that problem here.

The only problem is using a string like this as a primary key on a table might be problematic, performance-wise.

The other thing you could do is use the URL shortener approach: use your database's sequence generation feature, and convert the value (make sure you use the equivalent of LONG or BIGINT!) to something like Base36, which gives you a nice, concise string.

Generate a Hashcode for a string that is platform independent

I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar

Then just let the database index the strings for you!

Look, I have no idea how large your domain is, but you're going to get collisions very rapidly with very high likelihood if it's of any decent size at all. It's the birthday problem with a lot of people relative to the number of birthdays. You're going to have collisions, and lose any gain in speed you might think you're gaining by not just indexing the strings in the first place.

Anyway, you don't need us if you're stuck a few days away from release and you really need an invariant hash code across platform. There are really dumb, really fast implementations of hash code out there that you can use. Hell, you could come up with one yourself in the blink of an eye:

string s = "Hello, world!";
int hash = 17;
foreach(char c in s) {
    unchecked { hash = hash * 23 + c.GetHashCode(); } 
}

Or you could use the old Bernstein hash. And on and on. Are they going to give you the performance gain you're looking for? I don't know, they weren't meant to be used for this purpose. They were meant to be used for balancing hash tables. You're not balancing a hash table. You're using the wrong concept.

Edit (the below was written before the question was edited with new salient information):

You can't do this, at all, theoretically, without some kind of restriction on your input space. Your problem is far more severe than String.GetHashCode differening from platform to platform.

There are a lot of instances of string. In fact, way more instances than there are instances of Int32. So, because of the piegonhole principle, you will have collisions. You can't avoid this: your strings are pigeons and your Int32 hash codes are piegonholes and there are too many pigeons to go in the pigeonholes without some pigeonhole getting more than one pigeon. Because of collision problems, you can't use hash codes as unique keys for strings. It doesn't work. Period.

The only way you can make your current proposed design work (using Int32 as an identifier for instances of string) is if you restrict your input space of strings to something that has at size less than or equal to the number of Int32s. Even then, you'll have difficulty coming up with an algorithm that maps your input space of strings to Int32 in a unique way.

Even if you try to increase the number of pigeonholes by using SHA-512 or whatever, you still have the possibility of collisions. I doubt you considered that possibility previously in your design; this design path is DOA. And that's not what SHA-512 is for anyway, it's not to be used for unique identification of messages. It's just to reduce the likelihood of message forgery.

To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead.

Well, then you have a tremendous amount of work ahead of you. I'm sorry you discovered this so late in the game.

I note the documentation for String.GetHashCode:

The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.

And from Object.GetHashCode:

The GetHashCode method is suitable for use in hashing algorithms and data structures such as a hash table.

Hash codes are for balancing hash tables. They are not for identifying objects. You could have caught this sooner if you had used the concept for what it is meant to be used for.

Make GetHashCode method behave the same for strings for different processes

Why not to use the implementation suggested on the article you shared?

I'm copying it for reference:

static int GetDeterministicHashCode(this string str)
{
    unchecked
    {
        int hash1 = (5381 << 16) + 5381;
        int hash2 = hash1;

        for (int i = 0; i < str.Length; i += 2)
        {
            hash1 = ((hash1 << 5) + hash1) ^ str[i];
            if (i == str.Length - 1)
                break;
            hash2 = ((hash2 << 5) + hash2) ^ str[i + 1];
        }

        return hash1 + (hash2 * 1566083941);
    }
}

Efficient comparison of long text strings stored into a Database

To expand on my comment: Use the Murmur3 non-cryptographic hash algorithm. You can get it from NuGet here: https://www.nuget.org/packages/murmurhash/

Do not use the built-in GetHashCode() because, as you surmised, it isn't safe to persist outside of your process.
You can (but you shouldn't) use cryptographically-secure hash-functions because they're computationally expensive to calculate - and generally slow (not necessarily intentionally slow, but if SHA-256 was trivial to compute then I'd be a billionaire from finding SHA-256 hashes for Bitcoin mining).
- Whereas hashing-functions like Murmur are designed to be fast and fairly collision-resistant.

So here's what I'd do:

Write a function that serializes your LogEntry to a reusable MemoryStream for hashing by MurmurHash (the NuGet package I linked-to does not have the ability to automatically hash any object - and even if it did, you need a rigidly-defined hashing operation - as it is, serializing in-memory is the "best" approach for now). Provided you re-use the MemoryStream this won't be expensive.
Store the hash in your database and/or cache it in-memory to reduce IO ops.

In your case:

interface ILogEventHasher
{
    Int32 Compute32BitMurmurHash( LogEvent logEvent );
}

// Register this class as a singleton service in your DI container.
sealed class LogEventHasher : IDisposable
{
    private readonly MemoryStream ms = new MemoryStream();

    public Int32 Compute32BitMurmurHash( LogEvent logEvent )
    {
        if( logEvent is null ) throw new ArgumentNullException( nameof(logEvent) );

        this.ms.Position = 0;
        this.ms.Length   = 0; // This resets the length pointer, it doesn't deallocate memory.

        using( BinaryWriter wtr = new BinaryWriter( this.ms, Encoding.UTF8 ) )
        {
            wtr.Write( logEvent.DateTime );
            wtr.Write( logEvent.Level    );
            wtr.Write( logEvent.Message  );
        }

        this.ms.Position = 0; // This does NOT reset the Length pointer.

        using( Murmur32 mh = MurmurHash.Create32() )
        {
            Byte[] hash = mh.ComputeHash( this.ms );
            return BitConverter.ToInt32( hash ); // `hash` will be 4 bytes long.
        }

        // Reset stream state:
        this.ms.Position = 0;
        this.ms.Length = 0;

        // Shrink the MemoryStream if it's grown too large:
        const Int32 TWO_MEGABYTES = 2 * 1024 * 1024;
        if( this.ms.Capacity > TWO_MEGABYTES  )
        {
            this.ms.Capacity = TWO_MEGABYTES;
        }
    }

    public void Dispose()
    {
        this.ms.Dispose();
    }
}

To filter LogEvent instances in-memory, just use a HashSet<( DateTime utc, Int32 hash )>.

I don't recommend using HashSet<Int32> (storing just the Murmur hash-codes) because using a 32-bit non-cryptographically-secure hash-code doesn't give me enough confidence that a hash-code collision won't happen - but combining that with a DateTime value then gives me sufficient confidence (a DateTime value consumes 64 bits, or 8 bytes - so each memoized LogEvent will require 12 bytes. Given .NET's 2GiB array/object size limit (and assuming a HashSet load-factor of 0.75) means you can store up to 134,217,728 cached hash-codes in-memory. I hope that's enough!

Here's an example:

interface ILogEventFilterService
{
    Boolean AlreadyLoggedEvent( LogEvent e );
}

// Register as a singleton service.
class HashSetLogEventFilter : ILogEventFilterService
{
    // Somewhat amusingly, internally this HashSet will use GetHashCode() - rather than our own hashes, because it's storing a kind of user-level "weak-reference" to a LogEvent in the form of a ValueTuple.

    private readonly HashSet<( DateTime utc, Int32 hash )> hashes = new HashSet<( DateTime utc, Int32 hash )>();

    private readonly ILogEventHasher hasher;

    public HashSetLogEventFilter( ILogEventHasher hasher )
    {
        this.hasher = hasher ?? throw new ArgumentNullException( nameof(hasher) );
    }

    public Boolean AlreadyLoggedEvent( LogEvent e )
    {
        if( e is null ) throw new ArgumentNullException( nameof(e) );

        if( e.DateTime.Kind != DateTimeKind.Utc )
        {
            throw new ArgumentException( message: "DateTime value must be in UTC.", paramName: nameof(e) );
        }

        Int32 murmurHash = this.hasher.HashLogEvent( e );
        
        var t = ( utc: e.DateTime, hash: murmurHash );
        
        return this.hashes.Add( t ) == false;
    }
}

If you want to do it in the database directly, then define a custom user-defined-table-type for a table-valued-parameter for a stored-procedure that runs a MERGE statement of this form:

CREATE TABLE dbo.LogEvents (
    Utc        datetime2(7)   NOT NULL,
    MurmurHash int            NOT NULL,
    LogLevel   int            NOT NULL,
    Message    nvarchar(4000) NOT NULL
);

MERGE INTO dbo.LogEvents AS tgt WITH ( HOLDLOCK ) -- Always MERGE with HOLDLOCK!!!!!
USING @tvp AS src ON src.DateTime = tgt.DateTime AND src.MurmurHash = tgt.MurmurHash
WHEN NOT MATCHED BY TARGET THEN
    INSERT(     Utc,     MurmurHash,     LogLevel,     Message )
    VALUES( src.Utc, src.MurmurHash, src.LogLevel, src.Message )
;

Is the .NET string hash function portable?

From MSDN:

The default implementation of the GetHashCode method does not guarantee unique return values for different objects. Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework. Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes.

So no, you cannot assume that the value produced by GetHashCode is stable. This isn't just theoretical, either - we've seen the value change in the past.

If you want a stable hash, you'll have to generate it yourself.