Should I Return a Collection or a Stream

Should I return a Collection or a Stream?

The answer is, as always, "it depends". It depends on how big the returned collection will be. It depends on whether the result changes over time, and how important consistency of the returned result is. And it depends very much on how the user is likely to use the answer.

First, note that you can always get a Collection from a Stream, and vice versa:

// If API returns Collection, convert with stream()
getFoo().stream()...

// If API returns Stream, use collect()
Collection<T> c = getFooStream().collect(toList());

So the question is, which is more useful to your callers.

If your result might be infinite, there's only one choice: Stream.

If your result might be very large, you probably prefer Stream, since there may not be any value in materializing it all at once, and doing so could create significant heap pressure.

If all the caller is going to do is iterate through it (search, filter, aggregate), you should prefer Stream, since Stream has these built-in already and there's no need to materialize a collection (especially if the user might not process the whole result.) This is a very common case.

Even if you know that the user will iterate it multiple times or otherwise keep it around, you still may want to return a Stream instead, for the simple fact that whatever Collection you choose to put it in (e.g., ArrayList) may not be the form they want, and then the caller has to copy it anyway. If you return a Stream, they can do collect(toCollection(factory)) and get it in exactly the form they want.

The above "prefer Stream" cases mostly derive from the fact that Stream is more flexible; you can late-bind to how you use it without incurring the costs and constraints of materializing it to a Collection.

The one case where you must return a Collection is when there are strong consistency requirements, and you have to produce a consistent snapshot of a moving target. Then, you will want put the elements into a collection that will not change.

So I would say that most of the time, Stream is the right answer — it is more flexible, it doesn't impose usually-unnecessary materialization costs, and can be easily turned into the Collection of your choice if needed. But sometimes, you may have to return a Collection (say, due to strong consistency requirements), or you may want to return Collection because you know how the user will be using it and know this is the most convenient thing for them.

If you already have a suitable Collection "lying around", and it seems likely that your users would rather interact with it as a Collection, then it is a reasonable choice (though not the only one, and more brittle) to just return what you have.

Stream vs Collection as return type

In this context, the notion of "strong consistency requirement" is relative to the system or application within which the code resides. There's no specific notion of "strong consistency" that's independent of the system or application. Here's an example of "consistency" that is determined by what assertions you can make about a result. It should be clear that the semantics of these assertions are entirely application-specific.

Suppose you have some code that implements a room where people can enter and leave. You might want the relevant methods to be synchronized so that all enter and leave actions occur in some order. For example: (using Java 16)

record Person(String name) { }

public class Room {
final Set<Person> occupants = Collections.newSetFromMap(new ConcurrentHashMap<>());

public synchronized void enter(Person p) { occupants.add(p); }
public synchronized void leave(Person p) { occupants.remove(p); }
public Stream<Person> occupants() { return occupants.stream(); }
}

(Note, I'm using ConcurrentHashMap here because it doesn't throw ConcurrentModificationException if it's modified during iteration.)

Next, consider some threads to execute these methods in this order:

room.enter(new Person("Brett"));
room.enter(new Person("Chris"));
room.enter(new Person("Dana"));
room.leave(new Person("Dana"));
room.enter(new Person("Ashley"));

Now, at around the same time, suppose a caller gets a list of persons in the room by doing this:

List<Person> occupants1 = room.occupants().toList();

The result might be:

[Dana, Brett, Chris, Ashley]

How is this possible? The stream is lazily evaluated, and the elements are being pulled into a List at the same time other threads are modifying the source of the stream. In particular, it's possible for the stream to have "seen" Dana, then Dana is removed and Ashley added, and then the stream advances and encounters Ashley.

What does the stream represent, then? To find out, we have to dig into what ConcurrentHashMap says about its streams in the presence of concurrent modification. The set is built from CHM's keySet view, which says "The view's iterators and spliterators are weakly consistent." The definition of weakly consistent is in turn:

Most concurrent Collection implementations (including most Queues) also differ from the usual java.util conventions in that their Iterators and Spliterators provide weakly consistent rather than fast-fail traversal:

  • they may proceed concurrently with other operations
  • they will never throw ConcurrentModificationException
  • they are guaranteed to traverse elements as they existed upon construction exactly once, and may (but are not guaranteed to) reflect any modifications subsequent to construction.

What does this mean for our Room application? I'd say it means that if a person appears in the stream of occupants, that person was in the room at some point. That's a pretty weak statement. Note in particular that it does not allow you say that Dana and Ashley were in the room at the same time. It might seem that way from the contents of the List, but that would be incorrect, as a simple inspection reveals.

Now suppose we were to change the Room class to return a List instead of a Stream, and the caller were to use that instead:

// in class Room
public synchronized List<Person> occupants() { return List.copyOf(occupants); }

// in the caller
List<Person> occupants2 = room.occupants();

The result might be:

[Dana, Brett, Chris]

You can make much stronger statements about this List than about the previous one. You can say that Chris and Dana were in the room at the same time, and that at this particular point in time, that Ashley was not in the room.

The List version of occupants() gives you a snapshot of the occupants of the room at a particular time. This allows you much stronger statements than the stream version, which only tells you that certain persons were in the room at some point.

Why would you ever want an API with weaker semantics? Again, it depends on the application. If you want to send a survey to people who used room, all you care about is whether they were ever in the room. You don't care about other things, like who else was in the room at the same time.

The API with stronger semantics is potentially more expensive. It needs to make a copy of the collection, which means allocating space and spending time copying. It needs to hold a lock while it does this, to prevent concurrent modification, and this temporarily blocks other updates from proceeding.

To summarize, the notion of "strong" or "weak" consistency is highly dependent on the context. In this case I made up an example with some associated semantics, such as "in the room at the same time" or "was in the room at some point in time." The semantics required by the application determine the strength or weakness of the consistency of the results. This in turn drives what Java mechanisms should be used, such as streams vs. collections and when to apply locks.

Is it a good idea to substitute Collection for Stream in return values?

Let me propose a simple rule:

A Stream that is passed as a method argument or returned as a method's return value must be the tail of an unterminated pipeline.

This is probably so obvious to those of us who have worked on streams that we never bothered to write it down. But it's probably not obvious to people approaching streams for the first time, so it's likely worth a discussion.

The main rule is covered in the Streams API package documentation: a stream can have at most one terminal operation. Once it's been terminated, it's illegal to add any intermediate or terminal operations.

The other rule is that stream pipelines must be linear; they cannot have branches. This isn't terribly clearly documented, but it is mentioned in the Stream class documentation about two-thirds of the way down. This means that it's illegal to add an intermediate or terminal operation to a stream if it isn't the last operation on the pipeline.

Most of the stream methods are either intermediate or terminal operations. If you attempt to use one of these on a stream that's terminated or that's not the last operation, you find out pretty quickly by getting an IllegalArgumentException. This does happen occasionally, but I think that once people get the idea that a pipeline has to be linear, they learn to avoid this issue, and the problem goes away. I think this is pretty easy for most people to grasp; it shouldn't require a paradigm shift.

Once you understand this, it's clear that if you're going to hand a Stream instance to another piece of code -- either by passing it as an argument, or returning it to the caller -- it needs to be a stream source or the last intermediate operation in a pipeline. That is, it needs to be the tail of an unterminated pipeline.

To put in other words: it seems to me that if an API returns a stream, the general mindset should be that all interaction with it must terminate in the immediate context. It should be forbidden to pass the stream around.

I think this is too restrictive. As long as you adhere to the rule I proposed, you should be free to pass the stream around as much as you want. Indeed, there are a bunch of use cases for getting a stream from somewhere, modifying it, and passing it along. Here are a couple examples.

1) Open a text file containing the textual representation of a POJO on each line. Call File.lines() to get a Stream<String>. Map each line into a POJO instance, and return a Stream<POJO> to the caller. The caller might apply a filter or a sort operation and return the stream to its caller.

2) Given a Stream<POJO>, you might want to have a web interface to allow the user to provide a complex set of search criteria. (For example, consider a shopping site with lots of sorting and filtering options.) Instead of composing a big complex pipeline in code, you might have a method like the following:

Stream<POJO> applyCriteria(Stream<POJO>, SearchCriteria)

which would take a stream, apply the search criteria by appending various filters, and possibly sort or distinct operations, and return the resulting stream to the caller.

From these examples, I hope you can see that there is considerable flexibility in passing streams around, as long as what you pass around is always the tail of an unterminated pipeline.

Returning stream rather than list

I'm not saying you shouldn't return a Stream, and even less that you should never return a Stream, but doing it also has many disadvantages:

  • it doesn't tell the user of the API if the collection is ordered (List) or not (Set), or sorted (SortedSet)
  • it doesn't tell the user of the API if the collection can contain duplicates (List) or not (Set)
  • it doesn't allow the user to easily and quickly access the first or last element of the list, or even to know which size it has.
  • if the user of the API needs to do multiple passes over the collection, he's forced to copy every element into a new collection.

I would say that choosing to return a stream rather than a collection also depends on what you already have. If the collection is already materialized (think about a JPA entity having a OneToMany already materialized as a Set), I'd probably return an immutable wrapper over the collection. If, on the other hand, the collection to return is the result of a computation or transformation of another collection, returning a Stream might be a better choice.

Is it safe for a method to return a StreamT?

Not only is it safe, it is recommended by the chief Java architect.

Especially if your data is I/O-based and thus not yet materialized in memory at the time myMethod is called, it would be highly advisable to return a Stream instead of a List. The client may need to only consume a part of it or aggregate it into some data of fixed size. Thus you have the chance to go from O(n) memory requirement to O(1).

Note that if parallelization is also an interesting idea for your use case, you would be advised to use a custom spliterator whose splitting policy is adapted to the sequential nature of I/O data sources. In this case I can recommend a blog post of mine which presents such a spliterator.

Best type of iterator to return for a collection? Spliterator, Stream?

If the calling part is only going to iterate through it (map, filter...), you should prefer Stream, since Stream has these built-in function already and there's no need to materialize a specific collection.... let the caller materialize the stream if he wants.

Is Java 8 Stream a safe return type?

Yes, it is safe to do so. Streams do not/should not modify the underlying data structure.

A few excerpts from java.util.stream.Stream:

A sequence of elements […].

Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements […].

To preserve correct behavior, [behavioral parameters to stream operations …] must be non-interfering (they do not modify the stream source).

And from Package java.util.stream Description:

Streams differ from collections in several ways:

  • No storage. A stream is not a data structure that stores elements; instead, it conveys elements from a source […], through a pipeline of computational operations.
  • Functional in nature. An operation on a stream produces a result, but does not modify its source.

You might also see Non-interference.


[…] it would be impossible to mutate the underlying object given a stream from it.

While it would be possible to write our own implementation of java.util.Stream that modified the underlying data structure, it would be an error to do so. ; )


In response to the comment by @AlexisC.:

Getting a stream from the list […] can modify its content if it contains mutable objects.

This is a fair point. If we have a stream of elements which are mutable, we can do:

myObj.stream().forEach(( Foo foo ) -> ( foo.bar = baz ));

Collecting stream back into the same collection type

It is not possible without violating the principle on which the Java streams framework has been built on. It would completely violate the idea of abstracting the stream from its physical representation.

The sequence of bulk data operations goes in a pipeline, see the following picture:
Pipeline: A Sequence of Bulk Data Operations

The stream is somehow similar to the Schrödinger's cat - it is not materialized until you call the terminal operation. The stream handling is completely abstract and detached from the original stream source.

Pipeline as a Black Box

If you want to work so low-level with your original data storage, don't feel ashamed simply avoiding the streams. They are just a tool, not anything sacred. By introducing streams, the Good Old Collections are still as good as they were, with added value of the internal iteration - the new Iterable.forEach() method.


Added to satisfy your curiosity :)

A possible solution follows. I don't like it myself, and I have not been able to solve all the generics issues there, but it works with limitations.

The idea is creating a collector returning the same type as the input collection. However, not all the collections provide a nullary constructor (with no parameters), and without it the Class.newInstance() method does not work. There is also the problem of the awkwardness of checked exceptions within lambda expression. (It is mentioned in this nice answer here: https://stackoverflow.com/a/22919112/2886891)

public Collection<Integer> getBiggerThan(Collection<Integer> col, int value) {
// Collection below is an example of one of the rare appropriate
// uses of raw types. getClass returns the runtime type of col, and
// at runtime all type parameters have been erased.
@SuppressWarnings("rawtypes")
final Class<? extends Collection> clazz = col.getClass();
System.out.println("Input collection type: " + clazz);
final Supplier<Collection<Integer>> supplier = () -> {
try {
return clazz.newInstance();
}
catch (InstantiationException | IllegalAccessException e) {
throw new RuntimeException(
"A checked exception caught inside lambda", e);
}
};
// After all the ugly preparatory code, enjoy the clean pipeline:
return col.stream()
.filter(v -> v > value)
.collect(supplier, Collection::add, Collection::addAll);
}

As you can see, it works in general, supposed your original collection provides a nullary constructor.

public void test() {
final Collection<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);

final Collection<Integer> arrayList = new ArrayList<>(numbers);
final Collection<Integer> arrayList2 = getBiggerThan(arrayList, 6);
System.out.println(arrayList2);
System.out.println(arrayList2.getClass());
System.out.println();

final Collection<Integer> set = new HashSet<>(arrayList);
final Collection<Integer> set2 = getBiggerThan(set, 6);
System.out.println(set2);
System.out.println(set2.getClass());
System.out.println();

// This does not work as Arrays.asList() is of a type
// java.util.Arrays$ArrayList which does not provide a nullary constructor
final Collection<Integer> numbers2 = getBiggerThan(numbers, 6);
}

Should I be exposing StreamT on my interface?

You are asking the wrong question. After all, it isn’t hard to support both, e.g.

Collection<Foo> getStuff();
default Stream<Foo> stuff() {
return getStuff().stream();
}

so code using your interface doesn’t need an explicit stream() call, while implementors of the interface don’t need to bother with it as well.

As you are always exposing a Stream support whether via Collection.stream() or explicitly, the question is whether you want to expose the Collection. While it is cheap to provide a Stream for a Collection back-end it might turn out to be expensive to collect a Collection from a Stream.

So an interface exposing both ways suggests that they are equally usable while for an implementation not using a Collection back-end one of these methods might be way more expensive than the other.

So if you are sure that all implementations, including future ones, will always use (or have to support) a Collection it might be useful to expose it though the API as Collections support certain operations which Stream doesn’t. That’s especially true if you support modification of the underlying data via the exposed Collection.

Otherwise, supporting Stream access only might be the better choice. This gives implementations the freedom to have other back-ends than a Collection. However, that also implies that this API does not support Java versions prior to Java 8.

How to return collections' data without returning a collection itself?

Don't return a collection, return a Stream. That way it is easy for the user to know that they are getting a stream of objects, not a collection. And it's easy to change the implementation of the collection without changing the way it's used. It's trivial for the user to filter, map, reduce collect etc.

So:

class A {
private List<C> cs = new ArrayList<>();

public Stream<C> getCs() {
return cs.stream();
}
}

class B {
public void processCs(A a) {
a.getCs().filter(C::hasFooness).forEach(...);
}
}


Related Topics



Leave a reply



Submit