How Is the Fork/Join Framework Better Than a Thread Pool

How is the fork/join framework better than a thread pool?

I think the basic misunderstanding is, that the Fork/Join examples do NOT show work stealing but only some kind of standard divide and conquer.

Work stealing would be like this: Worker B has finished his work. He is a kind one, so he looks around and sees Worker A still working very hard. He strolls over and asks: "Hey lad, I could give you a hand." A replies. "Cool, I have this task of 1000 units. So far I have finished 345 leaving 655. Could you please work on number 673 to 1000, I'll do the 346 to 672." B says "OK, let's start so we can go to the pub earlier."

You see - the workers must communicate between each other even when they started the real work. This is the missing part in the examples.

The examples on the other hand show only something like "use subcontractors":

Worker A: "Dang, I have 1000 units of work. Too much for me. I'll do 500 myself and subcontract 500 to someone else." This goes on until the big task is broken down into small packets of 10 units each. These will be executed by the available workers. But if one packet is a kind of poison pill and takes considerably longer than other packets -- bad luck, the divide phase is over.

The only remaining difference between Fork/Join and splitting the task upfront is this: When splitting upfront you have the work queue full right from start. Example: 1000 units, the threshold is 10, so the queue has 100 entries. These packets are distributed to the threadpool members.

Fork/Join is more complex and tries to keep the number of packets in the queue smaller:

  • Step 1: Put one packet containing (1...1000) into queue
  • Step 2: One worker pops the packet(1...1000) and replaces it with two packets: (1...500) and (501...1000).
  • Step 3: One worker pops packet (500...1000) and pushes (500...750) and (751...1000).
  • Step n: The stack contains these packets: (1..500), (500...750), (750...875)... (991..1000)
  • Step n+1: Packet (991..1000) is popped and executed
  • Step n+2: Packet (981..990) is popped and executed
  • Step n+3: Packet (961..980) is popped and split into (961...970) and (971..980).
    ....

You see: in Fork/Join the queue is smaller (6 in the example) and the "split" and "work" phases are interleaved.

When multiple workers are popping and pushing simultaneously the interactions are not so clear of course.

Whats the benefit to use wrokstealing from ForkJoin rather than just ordinary thread pool's queue?

ForkJoinPool is designed for Recursive Actions.
This might for example be a divide-and-conquer algorithm like MergeSort.
In such an algorithm, one Thread would usually wait for the children to finish.

This is where "workstealing" comes in. The work would be stolen from the waiting Thread by the ones that actually have work to do.

If you have a fixed amount of Threads that will not spawn new Threads, you should just use a normal ExecutorService ThreadPool.

Java's Fork/Join vs ExecutorService - when to use which?

Fork-join allows you to easily execute divide and conquer jobs, which have to be implemented manually if you want to execute it in ExecutorService. In practice ExecutorService is usually used to process many independent requests (aka transaction) concurrently, and fork-join when you want to accelerate one coherent job.

Difference between ForkJoinPool and normal ExecutionService?

Although ForkJoinPool implements ExecutorService, it is conceptionally different from 'normal' executors.

You can easily see the difference if your tasks spawn more tasks and wait for them to complete, e.g. by calling

executor.invoke(new Task()); // blocks this thread until new task completes

In a normal executor service, waiting for other tasks to complete will block the current thread. There are two possible outcomes: If your executor service has a fixed number of threads, it might deadlock if the last running thread waits for another task to complete. If your executor dynamically creates new threads on demand, the number of threads might explode and you end up having thousands of threads which might cause starvation.

In opposite, the fork/join framework reuses the thread in the meantime to execute other tasks, so it won't deadlock although the number of threads is fixed:

new MyForkJoinTask().invoke();

So if you have a problem that you can solve recursively, think of using a ForkJoinPool as you can easily implement one level of recursion as ForkJoinTask.

Just check the number of running threads in your examples.

Why Fork/Join framework was introduced when all JAVA threads are Native threads created using OS libraries?

Fork join framework does not replace the original low level thread API; it makes it easier to use for certain classes of problems.

The original, low-level thread API works: you can use all the CPUs and all the cores on the CPUs installed on the system. If you ever try to actually write multithreaded applications, you'll quickly realize that it is hard.

The low level thread API works well for problems where threads are largely independent, and don't have to share information between each other - in other words, embarrassingly parallel problems. Many problems however are not like this. With the low level API, it is very difficult to implement complex algorithms in a way that is safe (produces correct results and does not have unwanted effects like dead lock) and efficient (does not waste system resources).

The Java fork/join framework, an implementation on the fork/join model, was created as a high level mechanism to make it easier to apply parallel computing for divide and conquer algorithms.

Play Framework: thread-pool-executor vs fork-join-executor

This isn't parallel code, everything inside of your Async call will run in one thread. In fact, Play! never spawns new threads in response to requests - it's event-based, there is an underlying thread pool that handles whatever work needs to be done.

The executor handles scheduling the work from Akka actors and from most Futures (not those created with Future.successful or Future.failed). In this case, each request will be a separate task that the executor has to schedule onto a thread.

The fork-join-executor replaced the thread-pool-executor because it allows work stealing, which improves efficiency. There is no difference in what can be parallelized with the two executors.

What's the advantage of a Java-5 ThreadPoolExecutor over a Java-7 ForkJoinPool?

ThreadPool (TP) and ForkJoinPool (FJ) are targeted towards different use cases. The main difference is in the number of queues employed by the different executors which decide what type of problems are better suited to either executor.

The FJ executor has n (aka parallelism level) separate concurrent queues (deques) while the TP executor has only one concurrent queue (these queues/deques maybe custom implementations not following the JDK Collections API). As a result, in scenarios where you have a large number of (usually relatively short running) tasks generated, the FJ executor will perform better as the independent queues will minimize concurrent operations and infrequent steals will help with load balancing. In TP due to the single queue, there will be concurrent operations every time work is dequeued and it will act as a relative bottleneck and limit performance.

In contrast, if there are relatively fewer long-running tasks the single queue in TP is no longer a bottleneck for performance. However, the n-independent queues and relatively frequent work-stealing attempts will now become a bottleneck in FJ as there can be possibly many futile attempts to steal work which add to overhead.

In addition, the work-stealing algorithm in FJ assumes that (older) tasks stolen from the deque will produce enough parallel tasks to reduce the number of steals. E.g. in quicksort or mergesort where older tasks equate to larger arrays, these tasks will generate more tasks and keep the queue non-empty and reduce the number of overall steals. If this is not the case in a given application then the frequent steal attempts again become a bottleneck. This is also noted in the javadoc for ForkJoinPool:

this class provides status check methods (for example getStealCount())
that are intended to aid in developing, tuning, and monitoring
fork/join applications.

Java fork/join framework logic

When you use an ExecutorService you will decide how many threads will be in the thread pool, and there is no kind of distinction between the tasks that you schedule and the subtasks that these tasks create.

ForkJoinPool class instead, manages threads based on 1)available processors and 2)task demand.

In this case, the subtasks created by the active tasks are being scheduled by different methods than the external tasks.

We typically have one fork-join pool for an entire application (unlike using the ExecutorService where it is typical to have more than 1 in any non-trivial application) and there is no need for shutdown.

I haven't reviewed the internals to give you a more low level explanation but if you see here there is a presentation and a benchmark showing measurements displaying the parallelism that is promised.

Update:
This framework addresses specific kind of problems (ExecutorService works better for tasks that have a mix of CPU and I/O activity).

The basic thinking here, is to use a recursive/divide and conquer approach in order to keep CPUs constantly busy. The idea is to create new tasks (forking) and suspend the current task until the new tasks complete (join) but without creating new threads and without having a shared work queue.

So Fork-join framework is implemented using work-stealing by creating a limited number of worker threads(as many as cores). Each worker thread maintains a private double-ended work queue.

When forking, worker pushes new task at the head of its deque. When waiting or idle, worker pops a task off the head of its deque and executes it instead of sleeping.

If worker’s deque is empty, steals an element off the tail of the deque of another randomly chosen worker.

I would recomend to read Data Parallelism in Java and also do some benchmarks yourself to be convinced. Theory is good only up to a point. After that do your measurements to see if there is significant performance edge or not



Related Topics



Leave a reply



Submit