Getting hash of a list of strings regardless of order
There are various different approaches here the under two main categories, each typically with their own benefits and disadvantages, in terms of effectiveness and performance. It is probably best to choose the simplest algorithm for whatever application and only use the more complex variants if necessary for whatever situation.
Note that these examples use EqualityComparer<T>.Default
since that will deal with null elements cleanly. You could do better than zero for null if desired. If T is constrained to struct it is also unnecessary. You can hoist the EqualityComparer<T>.Default
lookup out of the function if so desired.
Commutative Operations
If you use operations on the hashcodes of the individual entries which are commutative then this will lead to the same end result regardless of order.
There are several obvious options on numbers:
XOR
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
int hash = 0;
foreach (T element in source)
{
hash = hash ^ EqualityComparer<T>.Default.GetHashCode(element);
}
return hash;
}
One downside of that is that the hash for { "x", "x" } is the same as the hash for { "y", "y" }. If that's not a problem for your situation though, it's probably the simplest solution.
Addition
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
int hash = 0;
foreach (T element in source)
{
hash = unchecked (hash +
EqualityComparer<T>.Default.GetHashCode(element));
}
return hash;
}
Overflow is fine here, hence the explicit unchecked
context.
There are still some nasty cases (e.g. {1, -1} and {2, -2}, but it's more likely to be okay, particularly with strings. In the case of lists that may contain such integers, you could always implement a custom hashing function (perhaps one that takes the index of recurrence of the specific value as a parameter and returns a unique hash code accordingly).
Here is an example of such an algorithm that gets around the aforementioned problem in a fairly efficient manner. It also has the benefit of greatly increasing the distribution of the hash codes generated (see the article linked at the end for some explanation). A mathematical/statistical analysis of exactly how this algorithm produces "better" hash codes would be quite advanced, but testing it across a large range of input values and plotting the results should verify it well enough.
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
int hash = 0;
int curHash;
int bitOffset = 0;
// Stores number of occurences so far of each value.
var valueCounts = new Dictionary<T, int>();
foreach (T element in source)
{
curHash = EqualityComparer<T>.Default.GetHashCode(element);
if (valueCounts.TryGetValue(element, out bitOffset))
valueCounts[element] = bitOffset + 1;
else
valueCounts.Add(element, bitOffset);
// The current hash code is shifted (with wrapping) one bit
// further left on each successive recurrence of a certain
// value to widen the distribution.
// 37 is an arbitrary low prime number that helps the
// algorithm to smooth out the distribution.
hash = unchecked(hash + ((curHash << bitOffset) |
(curHash >> (32 - bitOffset))) * 37);
}
return hash;
}
Multiplication
Which has few if benefits over addition: small numbers and a mix of positive and negative numbers they may lead to a better distribution of hash bits. As a negative to offset this "1" becomes a useless entry contributing nothing and any zero element results in a zero.
You can special-case zero not to cause this major flaw.
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
int hash = 17;
foreach (T element in source)
{
int h = EqualityComparer<T>.Default.GetHashCode(element);
if (h != 0)
hash = unchecked (hash * h);
}
return hash;
}
Order first
The other core approach is to enforce some ordering first, then use any hash combination function you like. The ordering itself is immaterial so long as it is consistent.
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
int hash = 0;
foreach (T element in source.OrderBy(x => x, Comparer<T>.Default))
{
// f is any function/code you like returning int
hash = f(hash, element);
}
return hash;
}
This has some significant benefits in that the combining operations possible in f
can have significantly better hashing properties (distribution of bits for example) but this comes at significantly higher cost. The sort is O(n log n)
and the required copy of the collection is a memory allocation you can't avoid given the desire to avoid modifying the original. GetHashCode
implementations should normally avoid allocations entirely. One possible implementation of f
would be similar to that given in the last example under the Addition section (e.g. any constant number of bit shifts left followed by a multiplication by a prime - you could even use successive primes on each iteration at no extra cost, since they only need be generated once).
That said, if you were dealing with cases where you could calculate and cache the hash and amortize the cost over many calls to GetHashCode
this approach may yield superior behaviour. Also the latter approach is even more flexible since it can avoid the need to use the GetHashCode
on the elements if it knows their type and instead use per byte operations on them to yield even better hash distribution. Such an approach would likely be of use only in cases where the performance was identified as being a significant bottleneck.
Finally, if you want a reasonably comprehensive and fairly non-mathematical overview of the subject of hash codes and their effectiveness in general, these blog posts would be worthwhile reads, in particular the Implementing a simple hashing algorithm (pt II) post.
Hash function for collection of items that disregards ordering
You could use a frozenset
instead of a tuple:
>>> hash(frozenset([1, 2, 'a', 'b']))
1190978740469805404
>>>
>>> hash(frozenset([1, 'a', 2, 'b']))
1190978740469805404
>>>
>>> hash(frozenset(['a', 2, 'b', 1]))
1190978740469805404
However, the removal of duplicates from the iterable presents a subtle problem:
>>> hash(frozenset([1,2,1])) == hash(frozenset([1,2,2]))
True
You can fix this by creating a counter from the iterable using collections.Counter
, and calling frozenset
on the counter's items, thus preserving the count of each item from the original iterable:
>>> from collections import Counter
>>>
>>> hash(frozenset(Counter([1,2,1]).items()))
-307001354391131208
>>> hash(frozenset(Counter([1,1,2]).items()))
-307001354391131208
>>>
>>> hash(frozenset(Counter([1,2,1]).items())) == hash(frozenset(Counter([1,2,2]).items()))
False
Hashing data regardless of order
For cryptographic hashes, no. The very purpose of a cryptographic hash like MD5 is that even very similar inputs yield totally different outputs. This is to prevent giving hints to people trying to crack the hash.
As suggested in the comments, you'd have to sort your string contents to be exactly the same. This increases the risk of collisions; if security is your purpose, then just don't do this.
Suggestions regarding a string hashing function which ignores ordering of characters
Since a * b * c is equivalent to a * c * b, you can multiply together the characters instead of adding them.
This should also be a lot faster than having to sort all the characters within each string before hashing them.
Order-independent Hash Algorithm
The JDK itself proposes the following solution to this problem. The contract of the java.util.Set interface states:
Returns the hash code value for this set. The hash code of a set is defined to be the sum of the hash codes of the elements in the set, where the hash code of a null element is defined to be zero. This ensures that s1.equals(s2) implies that s1.hashCode()==s2.hashCode() for any two sets s1 and s2, as required by the general contract of Object.hashCode().
An alternative to using the sum of the entries' hash codes would be to use, for example, the ^
(XOR) operator.
The Scala language uses an ordering-invariant version of the Murmurhash algorithm (cf. the private scala.util.hashing.MurmurHash3
class) to implement the hashCode
(or ##
) method of its immutable sets and similar collections.
How to identify if two Liststring are equal regardless of the order?
You can use SequenceEqual
with additional order:
return list1.OrderBy(x => x).SequenceEqual(list2.OrderBy(x => x));
How to generate a unique hash for a collection of objects independent of their order
For optimal performance I would try to avoid iterating the whole collection every time GetHashCode
is called. The purpose of GetHashCode
is to improve performance to a point better than evaluating every element. So I might try maintaining the hash code when elements in the list are changed like this.
class Program
{
static void Main(string[] args)
{
MyClassList l = new MyClassList() { new MyClass() {Type="Bob", Id=1}, new MyClass() {Type="Jones", Id=2}};
MyClassList l2 = new MyClassList() { new MyClass() { Type = "Jones", Id = 2 }, new MyClass() { Type = "Bob", Id = 1 } };
MyClassList l3 = new MyClassList() { new MyClass() { Type = "Jones", Id = 2 }};
Console.WriteLine("{0} {1} {2}", l.GetHashCode(), l2.GetHashCode(), l3.GetHashCode());
l3.Add(new MyClass() { Type = "Bob", Id = 1 });
Console.WriteLine("{0}", l3.GetHashCode());
}
}
public class MyClass
{
public string Type { get; set; }
public int Id { get; set; }
public override int GetHashCode()
{
return (Type.GetHashCode() % 0x8000) | (int)((uint)Id.GetHashCode() & 0xFFFF0000);
}
}
public class MyClassList : IList<MyClass>
{
List<MyClass> internalList;
int hashCode = 0;
public MyClassList()
{
internalList = new List<MyClass>();
}
private void IncludeInHash(MyClass item)
{
hashCode ^= item.GetHashCode();
}
private void ExcludeFromHash(MyClass item)
{
IncludeInHash(item);
}
public override int GetHashCode()
{
return hashCode;
}
public int IndexOf(MyClass item)
{
return internalList.IndexOf(item);
}
public void Insert(int index, MyClass item)
{
internalList.Insert(index, item);
// Make sure Insert is successful (doesn't throw an exception) before affecting the hash
IncludeInHash(item);
}
public void RemoveAt(int index)
{
MyClass reduce = internalList[index];
internalList.RemoveAt(index);
// Make sure RemoveAt is successful before affecting the hash
ExcludeFromHash(reduce);
}
public MyClass this[int index]
{
get
{
return internalList[index];
}
set
{
MyClass reduce = internalList[index];
internalList[index] = value;
// Make sure these happen atomically; don't allow exceptions to prevent these from being accurate.
ExcludeFromHash(reduce);
IncludeInHash(value);
}
}
public void Add(MyClass item)
{
internalList.Add(item);
IncludeInHash(item);
}
public void Clear()
{
internalList.Clear();
hashCode = 0;
}
public bool Contains(MyClass item)
{
return internalList.Contains(item);
}
public void CopyTo(MyClass[] array, int arrayIndex)
{
internalList.CopyTo(array, arrayIndex);
}
public int Count
{
get { return internalList.Count; }
}
public bool IsReadOnly
{
get { return false; }
}
public bool Remove(MyClass item)
{
if (internalList.Remove(item))
{
ExcludeFromHash(item);
return true;
}
else
return false;
}
public IEnumerator<MyClass> GetEnumerator()
{
return internalList.AsReadOnly().GetEnumerator();
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
Related Topics
How to Ignore the Utf-8 Byte Order Marker in String Comparisons
Minimal and Correct Way to Map One-To-Many with Nhibernate
Class List Keeps Printing Out as Class Name in Console
Binding Objects Defined in Code-Behind
Fields of Class, Are They Stored in the Stack or Heap
Namespace and Class with the Same Name
How to Prevent a Windows from Being Moved
Memory Efficiency and Performance of String.Replace .Net Framework
How to Declare One to One Relationship Using Entity Framework 4 Code First (Poco)
How to Get All Subsets of an Array
C# Regex Split - Everything Inside Square Brackets
Attempted to Read or Write Protected Memory
How to Do a Bulk Insert -- Linq to Entities
Httpwebrequest Not Passing Credentials
Invalidoperationexception - Object Is Currently in Use Elsewhere