How Does Hashset Compare Elements for Equality

How does HashSet compare elements for equality?

It uses an IEqualityComparer<T> (EqualityComparer<T>.Default unless you specify a different one on construction).

When you add an element to the set, it will find the hash code using IEqualityComparer<T>.GetHashCode, and store both the hash code and the element (after checking whether the element is already in the set, of course).

To look an element up, it will first use the IEqualityComparer<T>.GetHashCode to find the hash code, then for all elements with the same hash code, it will use IEqualityComparer<T>.Equals to compare for actual equality.

That means you have two options:

  • Pass a custom IEqualityComparer<T> into the constructor. This is the best option if you can't modify the T itself, or if you want a non-default equality relation (e.g. "all users with a negative user ID are considered equal"). This is almost never implemented on the type itself (i.e. Foo doesn't implement IEqualityComparer<Foo>) but in a separate type which is only used for comparisons.
  • Implement equality in the type itself, by overriding GetHashCode and Equals(object). Ideally, implement IEquatable<T> in the type as well, particularly if it's a value type. These methods will be called by the default equality comparer.

Note how none of this is in terms of an ordered comparison - which makes sense, as there are certainly situations where you can easily specify equality but not a total ordering. This is all the same as Dictionary<TKey, TValue>, basically.

If you want a set which uses ordering instead of just equality comparisons, you should use SortedSet<T> from .NET 4 - which allows you to specify an IComparer<T> instead of an IEqualityComparer<T>. This will use IComparer<T>.Compare - which will delegate to IComparable<T>.CompareTo or IComparable.CompareTo if you're using Comparer<T>.Default.

Checking equality with a HashSet of objects

You need to create a comparer for the sets:

var setComparer = HashSet<Location>.CreateSetComparer();
return other.Variable.Equals(this.Variable) && setComparer.Equals(this.LocationList, other.LocationList);

Comparing objects in HashSet

Overriding the hashCode() method alone in your Star class will not work you will have to override the equals() method.

See the following code where we don't override the equals() method:

class Star {
int x, y;

public Star(int x, int y) {
this.x = x;
this.y = y;
}

@Override
public int hashCode() {
return Objects.hash(x, y);
}
}

public class Main {
public static void main(String[] args) {
Star s1 = new Star(0, 0);
Star s3 = new Star(0, 0);
Star s2 = new Star(31, -31*31);
Set<Star> set = new HashSet<>();
set.add(s1);
set.add(s2);
System.out.println(set.size());
}
}

This will print 3 (instead of 2 what you might expect).

The reason for this is that the add method of java.util.Set compares 2 objects based on equals() method and not based on hashCode() method.

In the above code for the Star class, if you add the equals() method the output will be 2 now. For your reference the you can override the equals() method as follows:

@Override
public boolean equals(Object startObject) {
if (this == startObject) return true;
if (startObject == null || getClass() != startObject.getClass()) return false;
Star star = (Star) startObject;
return x == star.x &&
y == star.y;
}

So why do you need to add hashCode()?

  1. As you are using HashSet the add method behind the scene will call the equals() method and will also call hashCode() to decide the bucket in which the new object should be put. To maintain the contract of hashCode() and equals() Both should be overridden.
  2. When ever you override equals(), it is recommended to override hashCode() also. (Vice-versa is also true). See this link for details.

Contract for hashCode() and equals(): If for two objects say o1 and o2, o1.equals(o2) is true then hash of o1 and o2 should be same.

Make sure that you understand this properly, from the above statement it is not implied that if the hash of 2 objects are same, then o1.equals(o2) should return true. It can be possible that for 2 objects, their hash is same but the o1.equals(o2) returns false

See here, what Object's hashCode() method guarantees.

See this link to get more detailed information about this topic.

How do I compare two HashSets on one property that both share?

You could just hash the names from the second array to use in a Linq filter to create the final HashSet

var excludeName = new HashSet<string>(people2.Select(x => x.Name));
var hash3 = new HasSet<Person>(people1.Where(x => !exludeName.Contains(x.Name));

This can be especially useful if that list of values to exclude is very large as it will make the entire process run in linear time.

Or here's how you can set up the HashSets with IEqualityComparer<T>.

public class PersonByNameComparer : IEqualityComparer<Peron>
{
public bool Equals(Person p1, Persion p2)
{
return p1.Name == p2.Name;
}

public int GetHashCode(Person p)
{
return p.Name.GetHashCode();
}
}

Note: This means that the HashSets cannot contain two items with the same Name even if the Id is different. But it also means it cannot contain to different objects with the same values like your current setup.

And then uses it like this.

var comparer = new PersonByNameComparer();

var hash1 = new HashSet<Person>(people1, comparer);
var hash2 = new HashSet<Person>(people2, comparer);

// Note that ExceptWith will mutate the existing hash.
hash1.ExceptWith(hash2);

// Alternatively you can create the new hash like this
var hash3 = new HashSet<Persion>(hash1.Where(p => !hash2.Contains(p)));

How do you determine if two HashSets are equal (by value, not by reference)?

Look at the method SetEquals.

my_hashset.SetEquals(other);

Why does HashSet allow equal items if hashcodes are different?

I think what you're really trying to ask is:

"Why does a HashSet add objects with inequal hash codes even if they claim to be equal?"

The distinction between my question and the question you posted is that you're assuming this behavior is a bug, and therefore you're getting grief for coming at it from that perspective. I think the other posters have done a thoroughly sufficient job of explaining why this is not a bug, however they have not addressed the underlying question.

I will try to do so here; I would suggest rephrasing your question to remove the accusations of poor documentation / bugs in Java so you can more directly explore why you're running into the behavior you're seeing.


The equals() documentations states (emphasis added):

Note that it is generally necessary to override the hashCode method whenever this method is overridden, so as to maintain the general contract for the hashCode method, which states that equal objects must have equal hash codes.

The contract between equals() and hashCode() isn't just an annoying quirk in the Java specification. It provides some very valuable benefits in terms of algorithm optimization. By being able to assume that a.equals(b) implies a.hashCode() == b.hashCode() we can do some basic equivalence tests without needing to call equals() directly. In particular, the invariant above can be turned around - a.hashCode() != b.hashCode() implies a.equals(b) will be false.

If you look at the code for HashMap (which HashSet uses internally), you'll notice an inner static class Entry, defined like so:

static class Entry<K,V> implements Map.Entry<K,V> {
final K key;
V value;
Entry<K,V> next;
int hash;
...
}

HashMap stores the key's hash code along with the key and value. Because a hash code is expected to not change over the time a key is stored in the map (see Map's documentation, "The behavior of a map is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is a key in the map.") it is safe for HashMap to cache this value. By doing so, it only needs to call hashCode() once for each key in the map, as opposed to every time the key is inspected.

Now lets look at the implementation of put(), where we see these cached hashes being taken advantage of, along with the invariant above:

public V put(K key, V value) {
...
int hash = hash(key);
int i = indexFor(hash, table.length);
for (Entry<K,V> e = table[i]; e != null; e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
// Replace existing element and return
}
}
// Insert new element
}

In particular, notice that the conditional only ever calls key.equals(k) if the hash codes are equal and the key isn't the exact same object, due to short-circuit evaluation. By the contract of these methods, it should be safe for HashMap to skip this call. If your objects are incorrectly implemented, these assumptions being made by HashMap are no longer true, and you will get back unusable results, including "duplicates" in your set.


Note that your claim "HashSet ... has an add(Object o) method, which is not inherited from another class" is not quite correct. While its parent class, AbstractSet, does not implement this method, the parent interface, Set, does specify the method's contract. The Set interface is not concerned with hashes, only equality, therefore it specifies the behavior of this method in terms of equality with (e==null ? e2==null : e.equals(e2)). As long as you follow the contracts, HashSet works as documented, but avoids actually doing wasteful work whenever possible. As soon as you break the rules however, HashSet cannot be expected to behave in any useful way.

Consider also that if you attempted to store objects in a TreeSet with an incorrectly implemented Comparator, you would similarly see nonsensical results. I documented some examples of how a TreeSet behaves when using an untrustworthy Comparator in another question: how to implement a comparator for StringBuffer class in Java for use in TreeSet?

Why does not Java HashSet.equals() check for object equality?

HashSet includes a number of optimizations that can explain all of this: first, if two objects are put into different buckets by their hash codes, or if they have different hash codes at all, they may skip the equals call. This is allowed by the contract of Object.hashCode; if two objects have different hash codes then they are not allowed to .equals to each other.

For the other case, HashSet takes advantage of the contract of .equals that specifies that if two objects are == to each other, then they must be .equals to each other. Part of HashSet's implementation here checks if the elements are ==, and if they are, it skips calling .equals.

If every method implements its contract correctly, this cannot change the semantics; HashSet will always behave exactly as if .equals were being called.

Set of sets not checking for equality

I have been pointed to this question. Although the accepted answer is helpful for checking equality of sets by hand (by means of HashSet<T>.SetEquals(HashSet<T>)), it doesn't quite help with applying this equality logic to a set of sets.

However the non-accepted answer (by Gregory Adam) gives the crucial hint as to how this can be accomplished, namely with HashSet<string>.CreateSetComparer(). Because HashSet has a constructor that accepts an equality comparer, this is the way to go:

HashSet<HashSet<string>> setOfSets = 
new HashSet<HashSet<string>>(HashSet<string>.CreateSetComparer());

This tells the outer HashSet how to "properly" (in the mathematical sense) compare objects the type of the inner HashSet (in my case HashSet<string>).

As Hans Passant and Eric Lippert have kindly pointed out, equality comparison of HashSets is comparatively expensive, even more so if it is applied to a nested HashSet, which might be one of the reasons why this hasn't been chosen as the default behavior.

The main reason however, according to Eric Lippert, for choosing reference equality is the mutability of the HashSet objects. Applied to my example: if I add two set-wise different HashSets (that is !set1.SetEquals(set2)) to my setOfSets and afterwards change set1's contents to become equal to set2, there are still two sets in setOfSets, although there should be only one set then. So this has led to a state which is inconsistent with the original requirement ("there shall not be two equal objects in a set").

The same cannot be done with strings because strings are immutable. If I add two different string objects "Foo" and "Bar" to a HashSet, there is no legal way to change them to become equal.



Related Topics



Leave a reply



Submit