How to Generate a Hashcode from a Byte Array in C#

C# Create a hash for a byte array or image

There's plenty of hashsum providers in .NET which create cryptographic hashes - which satisifies your condition that they are unique (for most purposes collision-proof). They are all extremely fast and the hashing definitely won't be the bottleneck in your app unless you're doing it a trillion times over.

Personally I like SHA1:

public static string GetHashSHA1(this byte[] data)
{
using (var sha1 = new System.Security.Cryptography.SHA1CryptoServiceProvider())
{
return string.Concat(sha1.ComputeHash(data).Select(x => x.ToString("X2")));
}
}

Even when people say one method might be slower than another, it's all in relative terms. A program dealing with images definitely won't notice the microsecond process of generating a hashsum.

And regarding collisions, for most purposes this is also irrelevant. Even "obsolete" methods like MD5 are still highly useful in most situations. Only recommend not using it when the security of your system relies on preventing collisions.

GetHashCode() on byte[] array

Like other non-primitive built-in types, it just returns something arbitrary. It definitely doesn't try to hash the contents of the array. See this answer.

Suitable hash code methods for an array of bytes?

Any of the built-in hashing functions should do; depending on how much you care about collisions these are your options (from most collisions to least):

  • MD5
  • SHA1
  • SHA256
  • SHA384
  • SHA512

They are as simple to use as:

var hash = SHA1.Create().ComputeHash(data);

Bonus Marks: If you don't care about security (which I don't think you do given that you are getting the hashes for images) you might want to look into Murmur hash, which is designed for content hashing and not secure hashing (and is thus much faster). It isn't, however, in the framework so you will have to find an implementation (and you should probably go for Murmur3).

Edit: If you are looking for a HASHCODE for a byte[] array it's entirely up to you, it usually consists of bit shifting (by primes) and XORing. E.g.

public class ByteArrayEqualityComparer : IEqualityComparer<byte[]>
{
public static readonly ByteArrayEqualityComparer Default = new ByteArrayEqualityComparer();
private ByteArrayEqualityComparer() { }

public bool Equals(byte[] x, byte[] y)
{
if (x == null && y == null)
return true;
if (x == null || y == null)
return false;
if (x.Length != y.Length)
return false;
for (var i = 0; i < x.Length; i++)
if (x[i] != y[i])
return false;
return true;
}

public int GetHashCode(byte[] obj)
{
if (obj == null || obj.Length == 0)
return 0;
var hashCode = 0;
for (var i = 0; i < obj.Length; i++)
// Rotate by 3 bits and XOR the new value.
hashCode = (hashCode << 3) | (hashCode >> (29)) ^ obj[i];
return hashCode;
}
}
// ...
var hc = ByteArrayEqualityComparer.Default.GetHashCode(data);

EDIT: If you want to validate that the value hasn't changed you should use CRC32.

In C#, is it possible to get a hash from a byte array that is filename safe?

var originalBytes = Encoding.ASCII.GetBytes(data);
var hashedBytes = Hasher.ComputeHash(originalBytes);

var builder = new StringBuilder();
foreach (Byte hashed in hashedBytes)
builder.AppendFormat("{0:x2}", hashed);

return builder.ToString();

this is basically the equivalent of what git does

Problems to write a byte array of a signed hash with a private key in a text file and get it back as the same format

The signature is binary data, so you could store it as raw bytes (on file ending with any extension .bin for example)

public static byte[] readFileBytes(String filepath) throws Exception {
return Files.readAllBytes(Paths.get(filepath));
}

public static void writeFileBytes(String filepath, byte[] input) throws Exception {
try (FileOutputStream fos = new FileOutputStream(filepath)) {
fos.write(input);
}
}

if you really want to store this binary/bytes array into text file, then you have to encoded safely by using base64 or hex:

public static String fromBytesToBase64(byte[] dataBytes) {
return Base64.getEncoder().encodeToString(dataBytes);
}

public static byte[] fromBase64ToBytes(String base64String) {
return Base64.getDecoder().decode(base64String);
}

The flow will be like:

byte[] signature = getTheSignatureAfterSignOperation();
String signatureInBase64 = fromBytesToBase64(signature);
saveToTextfile(signatureInBase64, "anyFile.txt");

// Other side
String signatureBase64 = readTextFile("anyFile.txt");
byte[] originalSignature = fromBase64ToBytes(signatureBase64);
doVerficiation(originalSignature);

Hashing an array in c#

To compute a hash code using the elements of an array, you can cast the array to IStructuralEquatable and then call the GetHashCode(IEqualityComparer) method, passing a comparer for the type of elements in the array.

(The cast is necessary because the Array class implements the method explicitly.)

For example, if your object has an int array, then you can implement GetHashCode like this:

public override int GetHashCode()
{
return ((IStructuralEquatable)this.array).GetHashCode(EqualityComparer<int>.Default);
}

In case you're curious, here's how the Array class implements the GetHashCode method (from the Reference Source):

internal static int CombineHashCodes(int h1, int h2) {
return (((h1 << 5) + h1) ^ h2);
}

int IStructuralEquatable.GetHashCode(IEqualityComparer comparer) {
if (comparer == null)
throw new ArgumentNullException("comparer");
Contract.EndContractBlock();

int ret = 0;

for (int i = (this.Length >= 8 ? this.Length - 8 : 0); i < this.Length; i++) {
ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(i)));
}

return ret;
}

As you can see, the current implementation only uses the last eight elements of the array.

GetHashCode() on byte[] array

Like other non-primitive built-in types, it just returns something arbitrary. It definitely doesn't try to hash the contents of the array. See this answer.

HashSet for byte arrays

Construct a HashSet with an IEqualityComparer<byte[]>. You don't want to use an interface here. While byte[] does in fact implement interfaces such as IEnumerable<byte>, IList<byte>, etc., use of them is a bad idea due to the weightiness involved. You don't use the fact that string implements IEnumerable<char> much at all so don't for byte[] either.

public class bytearraycomparer : IEqualityComparer<byte[]> {
public bool Equals(byte[] a, byte[] b)
{
if (a.Length != b.Length) return false;
for (int i = 0; i < a.Length; i++)
if (a[i] != b[i]) return false;
return true;
}
public int GetHashCode(byte[] a)
{
uint b = 0;
for (int i = 0; i < a.Length; i++)
b = ((b << 23) | (b >> 9)) ^ a[i];
return unchecked((int)b);
}
}

void test()
{
byte[] b1 = new byte[] { 1, 2, 3 };
byte[] b2 = new byte[] { 1, 2, 3 };

HashSet<byte[]> set = new HashSet<byte[]>(new bytearraycomparer );
set.Add(b1);
set.Add(b2);
Text = set.Count.ToString();
}

https://msdn.microsoft.com/en-us/library/bb359100(v=vs.110).aspx

If you were to use the answers in proposed duplicate question, you would end up with one function call and one array bounds check per byte processed. You don't want that. If expressed in the simplest way like so, the jitter will inline the fetches, and then notice that the bounds checks cannot fail (arrays can't be resized) and omit them. Only one function call for the entire array. Yay.

Lists tend to have only a few elements as compared to a byte array so often the dirt-simple hash function such as foreach (var item in list) hashcode = hashcode * 5 + item.GetHashCode(); if you use that kind of hash function for byte arrays you will have problems. The multiply by a small odd number trick ends up being rather biased too quickly for comfort here. My particular hash function given here is probably not optimal but we have run tests on this family and it performs quite well with three million entries. The multiply-by-odd was getting into trouble too quickly due to possessing numerous collisions that were only two bytes long/different. If you avoid the degenerate numbers this family will have no collisions in two bytes and most of them have no collisions in three bytes.

Considering actual use cases: By far the two most likely things here are byte strings and actual files being checked for sameness. In either case, taking a hash code of the first few bytes is most likely a bad idea. String's hash code uses the whole string, so byte strings should do the same, and most files being duplicated don't have a unique prefix in the first few bytes. For N entries, if you have hash collisions for the square root on N, you might as well have walked the entire array when generating the hash code, neglecting the fact that compares are slower than hashes.

How to compare two huge byte[] arrays using Object.GetHashCode() in c#?

GetHashCode is not defined for array types - you have to implement your own hash algorithm.

The value you see is actually based on the underlying reference and so two identical arrays will always have different hash codes, unless they are the same reference.

For integral types 32-bits or less, the hash code is equal to the value as converted to a 32-bit integer. For the 64 bit integral type, Int64, the upper 32 bits are XORed with the lower 32 bits (there's a shift in there also) for the hash code.

So when it comes to trying to compare two arrays 'quickly', you have to do it yourself.

You can use logic checks first - lengths are equal, start and end with the same byte value etc. Then you have a choice - either read byte - by - byte and compare values (or you can read 4 or 8 bytes at a time and use the BitConverter to convert blocks of bytes to Int32 or Int64 to make a single pair of values that might be quicker to check for equality) or use a general-purpose hash function to get a good guess of equality.

For this purpose you can use an MD5 hash - it's very quick to output a hash with MD5: How do I generate a hashcode from a byte array in C#?.

Getting two identical hash values from such a hash function does not guarantee equality, but in general if you are comparing arrays of bytes within the same data 'space' you shouldn't get a collision. By that I mean that, in general, examples of different data of the same type should nearly always produce different hashes. There's a lot more around the net on this than I am qualified to explain.



Related Topics



Leave a reply



Submit