Gethashcode() on Byte[] Array

GetHashCode() on byte[] array

Like other non-primitive built-in types, it just returns something arbitrary. It definitely doesn't try to hash the contents of the array. See this answer.

How to compare two huge byte[] arrays using Object.GetHashCode() in c#?

GetHashCode is not defined for array types - you have to implement your own hash algorithm.

The value you see is actually based on the underlying reference and so two identical arrays will always have different hash codes, unless they are the same reference.

For integral types 32-bits or less, the hash code is equal to the value as converted to a 32-bit integer. For the 64 bit integral type, Int64, the upper 32 bits are XORed with the lower 32 bits (there's a shift in there also) for the hash code.

So when it comes to trying to compare two arrays 'quickly', you have to do it yourself.

You can use logic checks first - lengths are equal, start and end with the same byte value etc. Then you have a choice - either read byte - by - byte and compare values (or you can read 4 or 8 bytes at a time and use the BitConverter to convert blocks of bytes to Int32 or Int64 to make a single pair of values that might be quicker to check for equality) or use a general-purpose hash function to get a good guess of equality.

For this purpose you can use an MD5 hash - it's very quick to output a hash with MD5: How do I generate a hashcode from a byte array in C#?.

Getting two identical hash values from such a hash function does not guarantee equality, but in general if you are comparing arrays of bytes within the same data 'space' you shouldn't get a collision. By that I mean that, in general, examples of different data of the same type should nearly always produce different hashes. There's a lot more around the net on this than I am qualified to explain.

Suitable hash code methods for an array of bytes?

Any of the built-in hashing functions should do; depending on how much you care about collisions these are your options (from most collisions to least):

  • MD5
  • SHA1
  • SHA256
  • SHA384
  • SHA512

They are as simple to use as:

var hash = SHA1.Create().ComputeHash(data);

Bonus Marks: If you don't care about security (which I don't think you do given that you are getting the hashes for images) you might want to look into Murmur hash, which is designed for content hashing and not secure hashing (and is thus much faster). It isn't, however, in the framework so you will have to find an implementation (and you should probably go for Murmur3).

Edit: If you are looking for a HASHCODE for a byte[] array it's entirely up to you, it usually consists of bit shifting (by primes) and XORing. E.g.

public class ByteArrayEqualityComparer : IEqualityComparer<byte[]>
{
public static readonly ByteArrayEqualityComparer Default = new ByteArrayEqualityComparer();
private ByteArrayEqualityComparer() { }

public bool Equals(byte[] x, byte[] y)
{
if (x == null && y == null)
return true;
if (x == null || y == null)
return false;
if (x.Length != y.Length)
return false;
for (var i = 0; i < x.Length; i++)
if (x[i] != y[i])
return false;
return true;
}

public int GetHashCode(byte[] obj)
{
if (obj == null || obj.Length == 0)
return 0;
var hashCode = 0;
for (var i = 0; i < obj.Length; i++)
// Rotate by 3 bits and XOR the new value.
hashCode = (hashCode << 3) | (hashCode >> (29)) ^ obj[i];
return hashCode;
}
}
// ...
var hc = ByteArrayEqualityComparer.Default.GetHashCode(data);

EDIT: If you want to validate that the value hasn't changed you should use CRC32.

GetHashCode() on byte[] array

Like other non-primitive built-in types, it just returns something arbitrary. It definitely doesn't try to hash the contents of the array. See this answer.

Hashing an array in c#

To compute a hash code using the elements of an array, you can cast the array to IStructuralEquatable and then call the GetHashCode(IEqualityComparer) method, passing a comparer for the type of elements in the array.

(The cast is necessary because the Array class implements the method explicitly.)

For example, if your object has an int array, then you can implement GetHashCode like this:

public override int GetHashCode()
{
return ((IStructuralEquatable)this.array).GetHashCode(EqualityComparer<int>.Default);
}

In case you're curious, here's how the Array class implements the GetHashCode method (from the Reference Source):

internal static int CombineHashCodes(int h1, int h2) {
return (((h1 << 5) + h1) ^ h2);
}

int IStructuralEquatable.GetHashCode(IEqualityComparer comparer) {
if (comparer == null)
throw new ArgumentNullException("comparer");
Contract.EndContractBlock();

int ret = 0;

for (int i = (this.Length >= 8 ? this.Length - 8 : 0); i < this.Length; i++) {
ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(i)));
}

return ret;
}

As you can see, the current implementation only uses the last eight elements of the array.

deepHashCode with byte array

First off, no need for "Deep". It's a primitive. You don't need Deep.

Just use Arrays.hashCode(byte[] yourArray).

Deep implies delving into the Objects contained within the array. Given that you are dealing with a primitive, you just need to use the primitive value itself in the calculation. That's why none of the Deep methods revolve around primitives.

Array[Byte] hashCode() returns different values every time in Scala

It is because the hashCode of an Array doesn't depend on its values (because it is mutable) but rather it is unique for each instance, it uses its memory address. So executing the code twice creates tow different arrays that have different memory addresses that have different hash codes.

The solution is to use an immutable structure, like a List.

"ABC".getBytes().toList.hashCode()
// res: Int = 1984571950

"ABC".getBytes().toList.hashCode()
// res: Int = 1984571950

HashSet for byte arrays

Construct a HashSet with an IEqualityComparer<byte[]>. You don't want to use an interface here. While byte[] does in fact implement interfaces such as IEnumerable<byte>, IList<byte>, etc., use of them is a bad idea due to the weightiness involved. You don't use the fact that string implements IEnumerable<char> much at all so don't for byte[] either.

public class bytearraycomparer : IEqualityComparer<byte[]> {
public bool Equals(byte[] a, byte[] b)
{
if (a.Length != b.Length) return false;
for (int i = 0; i < a.Length; i++)
if (a[i] != b[i]) return false;
return true;
}
public int GetHashCode(byte[] a)
{
uint b = 0;
for (int i = 0; i < a.Length; i++)
b = ((b << 23) | (b >> 9)) ^ a[i];
return unchecked((int)b);
}
}

void test()
{
byte[] b1 = new byte[] { 1, 2, 3 };
byte[] b2 = new byte[] { 1, 2, 3 };

HashSet<byte[]> set = new HashSet<byte[]>(new bytearraycomparer );
set.Add(b1);
set.Add(b2);
Text = set.Count.ToString();
}

https://msdn.microsoft.com/en-us/library/bb359100(v=vs.110).aspx

If you were to use the answers in proposed duplicate question, you would end up with one function call and one array bounds check per byte processed. You don't want that. If expressed in the simplest way like so, the jitter will inline the fetches, and then notice that the bounds checks cannot fail (arrays can't be resized) and omit them. Only one function call for the entire array. Yay.

Lists tend to have only a few elements as compared to a byte array so often the dirt-simple hash function such as foreach (var item in list) hashcode = hashcode * 5 + item.GetHashCode(); if you use that kind of hash function for byte arrays you will have problems. The multiply by a small odd number trick ends up being rather biased too quickly for comfort here. My particular hash function given here is probably not optimal but we have run tests on this family and it performs quite well with three million entries. The multiply-by-odd was getting into trouble too quickly due to possessing numerous collisions that were only two bytes long/different. If you avoid the degenerate numbers this family will have no collisions in two bytes and most of them have no collisions in three bytes.

Considering actual use cases: By far the two most likely things here are byte strings and actual files being checked for sameness. In either case, taking a hash code of the first few bytes is most likely a bad idea. String's hash code uses the whole string, so byte strings should do the same, and most files being duplicated don't have a unique prefix in the first few bytes. For N entries, if you have hash collisions for the square root on N, you might as well have walked the entire array when generating the hash code, neglecting the fact that compares are slower than hashes.

Why extra var2 byte array is used in hashCode method of StringLatin1 utility class?

Whatever current code you have posted is not the real code; it's the decompiled code which may vary from decompiler to decompiler and therefore you can not rely on it.



Related Topics



Leave a reply



Submit