Java Lambda Stream Distinct() on Arbitrary Key

Java Lambda Stream Distinct() on arbitrary key?

The distinct operation is a stateful pipeline operation; in this case it's a stateful filter. It's a bit inconvenient to create these yourself, as there's nothing built-in, but a small helper class should do the trick:

/**
* Stateful filter. T is type of stream element, K is type of extracted key.
*/
static class DistinctByKey<T,K> {
Map<K,Boolean> seen = new ConcurrentHashMap<>();
Function<T,K> keyExtractor;
public DistinctByKey(Function<T,K> ke) {
this.keyExtractor = ke;
}
public boolean filter(T t) {
return seen.putIfAbsent(keyExtractor.apply(t), Boolean.TRUE) == null;
}
}

I don't know your domain classes, but I think that, with this helper class, you could do what you want like this:

BigDecimal totalShare = orders.stream()
.filter(new DistinctByKey<Order,CompanyId>(o -> o.getCompany().getId())::filter)
.map(Order::getShare)
.reduce(BigDecimal.ZERO, BigDecimal::add);

Unfortunately the type inference couldn't get far enough inside the expression, so I had to specify explicitly the type arguments for the DistinctByKey class.

This involves more setup than the collectors approach described by Louis Wasserman, but this has the advantage that distinct items pass through immediately instead of being buffered up until the collection completes. Space should be the same, as (unavoidably) both approaches end up accumulating all distinct keys extracted from the stream elements.

UPDATE

It's possible to get rid of the K type parameter since it's not actually used for anything other than being stored in a map. So Object is sufficient.

/**
* Stateful filter. T is type of stream element.
*/
static class DistinctByKey<T> {
Map<Object,Boolean> seen = new ConcurrentHashMap<>();
Function<T,Object> keyExtractor;
public DistinctByKey(Function<T,Object> ke) {
this.keyExtractor = ke;
}
public boolean filter(T t) {
return seen.putIfAbsent(keyExtractor.apply(t), Boolean.TRUE) == null;
}
}

BigDecimal totalShare = orders.stream()
.filter(new DistinctByKey<Order>(o -> o.getCompany().getId())::filter)
.map(Order::getShare)
.reduce(BigDecimal.ZERO, BigDecimal::add);

This simplifies things a bit, but I still had to specify the type argument to the constructor. Trying to use diamond or a static factory method doesn't seem to improve things. I think the difficulty is that the compiler can't infer generic type parameters -- for a constructor or a static method call -- when either is in the instance expression of a method reference. Oh well.

(Another variation on this that would probably simplify it is to make DistinctByKey<T> implements Predicate<T> and rename the method to eval. This would remove the need to use a method reference and would probably improve type inference. However, it's unlikely to be as nice as the solution below.)

UPDATE 2

Can't stop thinking about this. Instead of a helper class, use a higher-order function. We can use captured locals to maintain state, so we don't even need a separate class! Bonus, things are simplified so type inference works!

public static <T> Predicate<T> distinctByKey(Function<? super T,Object> keyExtractor) {
Map<Object,Boolean> seen = new ConcurrentHashMap<>();
return t -> seen.putIfAbsent(keyExtractor.apply(t), Boolean.TRUE) == null;
}

BigDecimal totalShare = orders.stream()
.filter(distinctByKey(o -> o.getCompany().getId()))
.map(Order::getShare)
.reduce(BigDecimal.ZERO, BigDecimal::add);

Java 8 Distinct by property

Consider distinct to be a stateful filter. Here is a function that returns a predicate that maintains state about what it's seen previously, and that returns whether the given element was seen for the first time:

public static <T> Predicate<T> distinctByKey(Function<? super T, ?> keyExtractor) {
Set<Object> seen = ConcurrentHashMap.newKeySet();
return t -> seen.add(keyExtractor.apply(t));
}

Then you can write:

persons.stream().filter(distinctByKey(Person::getName))

Note that if the stream is ordered and is run in parallel, this will preserve an arbitrary element from among the duplicates, instead of the first one, as distinct() does.

(This is essentially the same as my answer to this question: Java Lambda Stream Distinct() on arbitrary key?)

Java Streams: How to do an efficient distinct and sort?

When you chain a distinct() operation after sorted(), the implementation will utilize the sorted nature of the data and avoid building an internal HashSet, which can be demonstrated by the following program

public class DistinctAndSort {
static int COMPARE, EQUALS, HASHCODE;
static class Tracker implements Comparable<Tracker> {
static int SERIAL;
int id;
Tracker() {
id=SERIAL++/2;
}
public int compareTo(Tracker o) {
COMPARE++;
return Integer.compare(id, o.id);
}
public int hashCode() {
HASHCODE++;
return id;
}
public boolean equals(Object obj) {
EQUALS++;
return super.equals(obj);
}
}
public static void main(String[] args) {
System.out.println("adjacent sorted() and distinct()");
Stream.generate(Tracker::new).limit(100)
.sorted().distinct()
.forEachOrdered(o -> {});
System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
COMPARE, EQUALS, HASHCODE);
COMPARE=EQUALS=HASHCODE=0;
System.out.println("now with intermediate operation");
Stream.generate(Tracker::new).limit(100)
.sorted().map(x -> x).distinct()
.forEachOrdered(o -> {});
System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
COMPARE, EQUALS, HASHCODE);
}
}

which will print

adjacent sorted() and distinct()
compareTo: 99, EQUALS: 99, HASHCODE: 0
now with intermediate operation
compareTo: 99, EQUALS: 100, HASHCODE: 200

The intermediate operation, as simple as map(x -> x), can’t be recognized by the Stream implementation, hence, it must assume that the elements might not be sorted in respect to the mapping function’s result.

There is no guaranty that this kind of optimization happens, however, it is reasonable to assume that the developers of the Stream implementation will not remove that optimization and even try to add more optimizations, so rolling your own implementation will prevent your code from benefiting from future optimizations.

Further, what you have created is a “stateful predicate”, which is strongly discouraged, and, of course, will break when being used with a parallel stream.

If you don’t trust the Stream API to perform this operation efficiently enough, you might be better off implementing this particular operation without the Stream API.

Equivalent to stream distinct using a custom comparator

HashingStrategy is the concept you're looking for. It's a strategy interface that allows you to define custom implementations of equals and hashcode.

public interface HashingStrategy<E>
{
int computeHashCode(E object);
boolean equals(E object1, E object2);
}

Streams don't support hashing strategies but Eclipse Collections does. It has sets and maps that support hashing strategies as well as overloads of methods like distinct() that take hashing strategies.

This would work well for Strings. For example, here's how we could get all distinct Strings ignoring case.

MutableList<String> strings = Lists.mutable.with("Hello", "world", "HELLO", "World");
assertThat(
strings.distinct(HashingStrategies.fromFunction(String::toLowerCase)),
is(equalTo(Lists.immutable.with("Hello", "world"))));

Or you can write the hashing strategy by hand to avoid garbage creation.

HashingStrategy<String> caseInsensitive = new HashingStrategy<String>()
{
@Override
public int computeHashCode(String string)
{
int hashCode = 0;
for (int i = 0; i < string.length(); i++)
{
hashCode = 31 * hashCode + Character.toLowerCase(string.charAt(i));
}
return hashCode;
}

@Override
public boolean equals(String string1, String string2)
{
return string1.equalsIgnoreCase(string2);
}
};

assertThat(
strings.distinct(caseInsensitive),
is(equalTo(Lists.immutable.with("Hello", "world"))));

This could work for Points too, but only if you can group all points within non-overlapping regions to have the same hashcode. If you're using a Comparator defined to return 0 when two Points are close enough, then you can run into transitivity problems. For example, Points A, B, and C can fall along a line with A and C both close to B but far from each other. Still, if this is a useful concept to you, we'd welcome a pull request adding ListIterable.distinct(Comparator) to the API.

Note: I am a committer for Eclipse Collections.

Java 8 Distinct by property

Consider distinct to be a stateful filter. Here is a function that returns a predicate that maintains state about what it's seen previously, and that returns whether the given element was seen for the first time:

public static <T> Predicate<T> distinctByKey(Function<? super T, ?> keyExtractor) {
Set<Object> seen = ConcurrentHashMap.newKeySet();
return t -> seen.add(keyExtractor.apply(t));
}

Then you can write:

persons.stream().filter(distinctByKey(Person::getName))

Note that if the stream is ordered and is run in parallel, this will preserve an arbitrary element from among the duplicates, instead of the first one, as distinct() does.

(This is essentially the same as my answer to this question: Java Lambda Stream Distinct() on arbitrary key?)

Convert a ListObject to a MapString, ListObject and distinct the duplicates according to some properties of List's element?

The following groupingBy should work.
As part of your additional requirements, we know that the equals method of RawItem objects cannot be used, which is why we are using a filter instead of the Stream distinct method (*):

    Set<Object> seen = ConcurrentHashMap.newKeySet();

Map<String, List<Item>> m =
l.stream()
.filter(item -> seen.add(item.category + ":" + item.name + ":" + item.code))
.collect(Collectors.groupingBy(item -> item.category,
Collectors.mapping(item -> new Item(item.code, item.name),
Collectors.toList())));

Note that in order to allow the method distinct to work and remove duplicates, you need to have a proper implementation of the hashCode and equals methods in the RawItem class.

As an example, the following test:

List<RawItem> list = Arrays.asList(new RawItem("a", 1, "dfg"),
new RawItem("a", 1, "dfg"),
new RawItem("a", 1, "fdgdfdfgdg"),
new RawItem("b", 1, "dfg"));

Map<String, List<Item>> map = // the above

System.out.println(map);

gives

{
a=[Item{code=1, name='dfg'}, Item{code=1, name='fdgdfdfgdg'}],
b=[Item{code=1, name='dfg'}]
}

(*) Edit:
As shared by the OP, the following posts help to design a solution based on a custom equivalence relationship with the stream API, when distinct() cannot be used:

  • Java 8 Distinct by property
  • Java Lambda Stream Distinct() on arbitrary key?
  • Java 8 how to get distinct list on more than one property

Java8 Streams - Remove Duplicates With Stream Distinct

Whenever you override equals, you also need to override the hashCode() method, which will be used in the implementation of distinct().

In this case, you could just use

@Override public int hashCode() {
return test.charAt(0);
}

...which would work just fine.

Java Streams: How to do an efficient distinct and sort?

When you chain a distinct() operation after sorted(), the implementation will utilize the sorted nature of the data and avoid building an internal HashSet, which can be demonstrated by the following program

public class DistinctAndSort {
static int COMPARE, EQUALS, HASHCODE;
static class Tracker implements Comparable<Tracker> {
static int SERIAL;
int id;
Tracker() {
id=SERIAL++/2;
}
public int compareTo(Tracker o) {
COMPARE++;
return Integer.compare(id, o.id);
}
public int hashCode() {
HASHCODE++;
return id;
}
public boolean equals(Object obj) {
EQUALS++;
return super.equals(obj);
}
}
public static void main(String[] args) {
System.out.println("adjacent sorted() and distinct()");
Stream.generate(Tracker::new).limit(100)
.sorted().distinct()
.forEachOrdered(o -> {});
System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
COMPARE, EQUALS, HASHCODE);
COMPARE=EQUALS=HASHCODE=0;
System.out.println("now with intermediate operation");
Stream.generate(Tracker::new).limit(100)
.sorted().map(x -> x).distinct()
.forEachOrdered(o -> {});
System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
COMPARE, EQUALS, HASHCODE);
}
}

which will print

adjacent sorted() and distinct()
compareTo: 99, EQUALS: 99, HASHCODE: 0
now with intermediate operation
compareTo: 99, EQUALS: 100, HASHCODE: 200

The intermediate operation, as simple as map(x -> x), can’t be recognized by the Stream implementation, hence, it must assume that the elements might not be sorted in respect to the mapping function’s result.

There is no guaranty that this kind of optimization happens, however, it is reasonable to assume that the developers of the Stream implementation will not remove that optimization and even try to add more optimizations, so rolling your own implementation will prevent your code from benefiting from future optimizations.

Further, what you have created is a “stateful predicate”, which is strongly discouraged, and, of course, will break when being used with a parallel stream.

If you don’t trust the Stream API to perform this operation efficiently enough, you might be better off implementing this particular operation without the Stream API.



Related Topics



Leave a reply



Submit