Why Are Java Streams Once-Off

Why are Java Streams once-off?

I have some recollections from the early design of the Streams API that might shed some light on the design rationale.

Back in 2012, we were adding lambdas to the language, and we wanted a collections-oriented or "bulk data" set of operations, programmed using lambdas, that would facilitate parallelism. The idea of lazily chaining operations together was well established by this point. We also didn't want the intermediate operations to store results.

The main issues we needed to decide were what the objects in the chain looked like in the API and how they hooked up to data sources. The sources were often collections, but we also wanted to support data coming from a file or the network, or data generated on-the-fly, e.g., from a random number generator.

There were many influences of existing work on the design. Among the more influential were Google's Guava library and the Scala collections library. (If anybody is surprised about the influence from Guava, note that Kevin Bourrillion, Guava lead developer, was on the JSR-335 Lambda expert group.) On Scala collections, we found this talk by Martin Odersky to be of particular interest: Future-Proofing Scala Collections: from Mutable to Persistent to Parallel. (Stanford EE380, 2011 June 1.)

Our prototype design at the time was based around Iterable. The familiar operations filter, map, and so forth were extension (default) methods on Iterable. Calling one added an operation to the chain and returned another Iterable. A terminal operation like count would call iterator() up the chain to the source, and the operations were implemented within each stage's Iterator.

Since these are Iterables, you can call the iterator() method more than once. What should happen then?

If the source is a collection, this mostly works fine. Collections are Iterable, and each call to iterator() produces a distinct Iterator instance that is independent of any other active instances, and each traverses the collection independently. Great.

Now what if the source is one-shot, like reading lines from a file? Maybe the first Iterator should get all the values but the second and subsequent ones should be empty. Maybe the values should be interleaved among the Iterators. Or maybe each Iterator should get all the same values. Then, what if you have two iterators and one gets farther ahead of the other? Somebody will have to buffer up the values in the second Iterator until they're read. Worse, what if you get one Iterator and read all the values, and only then get a second Iterator. Where do the values come from now? Is there a requirement for them all to be buffered up just in case somebody wants a second Iterator?

Clearly, allowing multiple Iterators over a one-shot source raises a lot of questions. We didn't have good answers for them. We wanted consistent, predictable behavior for what happens if you call iterator() twice. This pushed us toward disallowing multiple traversals, making the pipelines one-shot.

We also observed others bumping into these issues. In the JDK, most Iterables are collections or collection-like objects, which allow multiple traversal. It isn't specified anywhere, but there seemed to be an unwritten expectation that Iterables allow multiple traversal. A notable exception is the NIO DirectoryStream interface. Its specification includes this interesting warning:

While DirectoryStream extends Iterable, it is not a general-purpose Iterable as it supports only a single Iterator; invoking the iterator method to obtain a second or subsequent iterator throws IllegalStateException.

[bold in original]

This seemed unusual and unpleasant enough that we didn't want to create a whole bunch of new Iterables that might be once-only. This pushed us away from using Iterable.

About this time, an article by Bruce Eckel appeared that described a spot of trouble he'd had with Scala. He'd written this code:

// Scala
val lines = fromString(data).getLines
val registrants = lines.map(Registrant)
registrants.foreach(println)
registrants.foreach(println)

It's pretty straightforward. It parses lines of text into Registrant objects and prints them out twice. Except that it actually only prints them out once. It turns out that he thought that registrants was a collection, when in fact it's an iterator. The second call to foreach encounters an empty iterator, from which all values have been exhausted, so it prints nothing.

This kind of experience convinced us that it was very important to have clearly predictable results if multiple traversal is attempted. It also highlighted the importance of distinguishing between lazy pipeline-like structures from actual collections that store data. This in turn drove the separation of the lazy pipeline operations into the new Stream interface and keeping only eager, mutative operations directly on Collections. Brian Goetz has explained the rationale for that.

What about allowing multiple traversal for collection-based pipelines but disallowing it for non-collection-based pipelines? It's inconsistent, but it's sensible. If you're reading values from the network, of course you can't traverse them again. If you want to traverse them multiple times, you have to pull them into a collection explicitly.

But let's explore allowing multiple traversal from collections-based pipelines. Let's say you did this:

Iterable<?> it = source.filter(...).map(...).filter(...).map(...);
it.into(dest1);
it.into(dest2);

(The into operation is now spelled collect(toList()).)

If source is a collection, then the first into() call will create a chain of Iterators back to the source, execute the pipeline operations, and send the results into the destination. The second call to into() will create another chain of Iterators, and execute the pipeline operations again. This isn't obviously wrong but it does have the effect of performing all the filter and map operations a second time for each element. I think many programmers would have been surprised by this behavior.

As I mentioned above, we had been talking to the Guava developers. One of the cool things they have is an Idea Graveyard where they describe features that they decided not to implement along with the reasons. The idea of lazy collections sounds pretty cool, but here's what they have to say about it. Consider a List.filter() operation that returns a List:

The biggest concern here is that too many operations become expensive, linear-time propositions. If you want to filter a list and get a list back, and not just a Collection or an Iterable, you can use ImmutableList.copyOf(Iterables.filter(list, predicate)), which "states up front" what it's doing and how expensive it is.

To take a specific example, what's the cost of get(0) or size() on a List? For commonly used classes like ArrayList, they're O(1). But if you call one of these on a lazily-filtered list, it has to run the filter over the backing list, and all of a sudden these operations are O(n). Worse, it has to traverse the backing list on every operation.

This seemed to us to be too much laziness. It's one thing to set up some operations and defer actual execution until you so "Go". It's another to set things up in such a way that hides a potentially large amount of recomputation.

In proposing to disallow non-linear or "no-reuse" streams, Paul Sandoz described the potential consequences of allowing them as giving rise to "unexpected or confusing results." He also mentioned that parallel execution would make things even trickier. Finally, I'd add that a pipeline operation with side effects would lead to difficult and obscure bugs if the operation were unexpectedly executed multiple times, or at least a different number of times than the programmer expected. (But Java programmers don't write lambda expressions with side effects, do they? DO THEY??)

So that's the basic rationale for the Java 8 Streams API design that allows one-shot traversal and that requires a strictly linear (no branching) pipeline. It provides consistent behavior across multiple different stream sources, it clearly separates lazy from eager operations, and it provides a straightforward execution model.


With regard to IEnumerable, I am far from an expert on C# and .NET, so I would appreciate being corrected (gently) if I draw any incorrect conclusions. It does appear, however, that IEnumerable permits multiple traversal to behave differently with different sources; and it permits a branching structure of nested IEnumerable operations, which may result in some significant recomputation. While I appreciate that different systems make different tradeoffs, these are two characteristics that we sought to avoid in the design of the Java 8 Streams API.

The quicksort example given by the OP is interesting, puzzling, and I'm sorry to say, somewhat horrifying. Calling QuickSort takes an IEnumerable and returns an IEnumerable, so no sorting is actually done until the final IEnumerable is traversed. What the call seems to do, though, is build up a tree structure of IEnumerables that reflects the partitioning that quicksort would do, without actually doing it. (This is lazy computation, after all.) If the source has N elements, the tree will be N elements wide at its widest, and it will be lg(N) levels deep.

It seems to me -- and once again, I'm not a C# or .NET expert -- that this will cause certain innocuous-looking calls, such as pivot selection via ints.First(), to be more expensive than they look. At the first level, of course, it's O(1). But consider a partition deep in the tree, at the right-hand edge. To compute the first element of this partition, the entire source has to be traversed, an O(N) operation. But since the partitions above are lazy, they must be recomputed, requiring O(lg N) comparisons. So selecting the pivot would be an O(N lg N) operation, which is as expensive as an entire sort.

But we don't actually sort until we traverse the returned IEnumerable. In the standard quicksort algorithm, each level of partitioning doubles the number of partitions. Each partition is only half the size, so each level remains at O(N) complexity. The tree of partitions is O(lg N) high, so the total work is O(N lg N).

With the tree of lazy IEnumerables, at the bottom of the tree there are N partitions. Computing each partition requires a traversal of N elements, each of which requires lg(N) comparisons up the tree. To compute all the partitions at the bottom of the tree, then, requires O(N^2 lg N) comparisons.

(Is this right? I can hardly believe this. Somebody please check this for me.)

In any case, it is indeed cool that IEnumerable can be used this way to build up complicated structures of computation. But if it does increase the computational complexity as much as I think it does, it would seem that programming this way is something that should be avoided unless one is extremely careful.

Java 8 Streams: Stream closed by intermediate operation

The point is that method filter returns new stream. The old one is terminated after filtering.

/**
* Returns a stream consisting of the elements of this stream that match
* the given predicate.
*
* <p>This is an <a href="package-summary.html#StreamOps">intermediate
* operation</a>.
*
* @param predicate a <a href="package-summary.html#NonInterference">non-interfering</a>,
* <a href="package-summary.html#Statelessness">stateless</a>
* predicate to apply to each element to determine if it
* should be included
* @return the new stream
*/
Stream<T> filter(Predicate<? super T> predicate);

java-8 streams : Are new streams from intermediate operations returned without increase in memory?

Try to read here http://www.oracle.com/technetwork/articles/java/ma14-java-se-8-streams-2177646.html and/or here http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/, the concepts you're trying to figure out is explained fairly well I think.

Basically what you can see is that for most intermediate operations, they do not happen all at once for each operation. 1 element at a time they are processed through all of the intermediate operations and either discarded or put into a collection/added to a sum/printed etc. depending on the terminal operation. If it is a collect type terminal operation there will of course be some memory overhead when making this new collection, but individually in the stream nothing is saved. This is also why you cannot iterate over a stream twice (partly).

There are however some operations, such as stream.sorted(func) that may need some state during processing.

Are Java streams meant only to be used for arrays? How about single elements?

You might feel better about the if-else if you use it in a more functional style rather than short-circuiting:

if (!aes.matches(request.getPassword(), dbUser.getPassword())) {
return ResponseEntity.status(403).build();
}
else {
return logUserIn(dbUser);
}

Doing equivalent in one statement with Stream/Optional is harder to read and less performant.

You might consider the possibility of making findByEmail return Optional<User>, which is more idiomatic for any "find" method. Then you could combine the two approaches like

return userRepo.findByEmail(request.getEmail()).map(dbUser -> {
if (!aes.matches(request.getPassword(), dbUser.getPassword())) {
return ResponseEntity.status(403).build();
}
else {
return logUserIn(dbUser);
}
})... // .orElse(null) / .orElseThrow(...)

Streams in Java are much slower than native for loop - why?

explaining this entirely would take a lot of time; but what you are testing here is the "cold" start, basically without much JIT and a simple loop is not allocating Objects like a Stream solution - thus a lot less time. Streams have an infrastructure to run on - that takes a while to "heat" up to become more performant.

Those portions of the code are not equivalent either, as in one you are using Math::max, in the one one a plain >. You could test this code iterating a lot more and see the results, but even so you should probably use a tool that is tailored for micro-benchmarks, I know about JMH (and Caliper from google - but I trust only the first one).

With JMH you could test these methods with no JIT at all, C1 or C2 compiler only, or some other settings, for an example of some set-ups see this answer

Why don't Java Streams have intermediate combining operations?

Not every terminal operation will “pull all the values through the stream”. The terminal operations iterator() and spliterator() do not immediately fetch all values and allow to do lazy processing, including constructing a new Stream again. For the latter, it’s strongly recommended to use spliterator() as this allows more meta information to be passed to the new stream and also implies less wrapping of objects.

E.g. your second example could be implemented as

public static <T> Stream<T> replaceWhenEmpty(Stream<T> s, Supplier<Stream<T>> fallBack) {
boolean parallel = s.isParallel();
Spliterator<T> sp = s.spliterator();
Stream.Builder<T> firstElement;
if(sp.getExactSizeIfKnown()==0 || !sp.tryAdvance(firstElement=Stream.builder())) {
s.close();
return fallBack.get();
}
return Stream.concat(firstElement.build(), StreamSupport.stream(sp, parallel))
.onClose(s::close);
}

For your general question, I don’t see how a general abstraction of these examples should look like, except like the spliterator() method that already exist. As the documentation puts it

However, if the provided stream operations do not offer the desired functionality, the BaseStream.iterator() and BaseStream.spliterator() operations can be used to perform a controlled traversal.

In Java streams programming, why am I not able to set a certain field of all my objects to null?

This problem is that the stream has not started. So you need to put a terminal
operation at the end to start the stream. Since you are changing an object you don't need to return anything. This is not an appropriate way to use streams.

public void find() {
List<User> users = userRepository.findAll();
users
.stream()
.map(user -> {user.setPassword(null); return user;}).count();

}

A better way is to do the following which simply iterates over the list setting the password to null.

userRepository.findAll().forEach(s->s.setPassword(null));


Related Topics



Leave a reply



Submit