Parallel.Foreach Slower Than Foreach

Parallel.ForEach Slower than ForEach

Suppose you have a task to perform. Let's say you're a math teacher and you have twenty papers to grade. It takes you two minutes to grade a paper, so it's going to take you about forty minutes.

Now let's suppose that you decide to hire some assistants to help you grade papers. It takes you an hour to locate four assistants. You each take four papers and you are all done in eight minutes. You've traded 40 minutes of work for 68 total minutes of work including the extra hour to find the assistants, so this isn't a savings. The overhead of finding the assistants is larger than the cost of doing the work yourself.

Now suppose you have twenty thousand papers to grade, so it is going to take you about 40000 minutes. Now if you spend an hour finding assistants, that's a win. You each take 4000 papers and are done in a total of 8060 minutes instead of 40000 minutes, a savings of almost a factor of 5. The overhead of finding the assistants is basically irrelevant.

Parallelization is not free. The cost of splitting up work amongst different threads needs to be tiny compared to the amount of work done per thread.

Parallel.ForEach slower than normal foreach

It's performance ramifications of having a very small delegate body.

We can achieve better performance using the partitioning. In this case the body delegate performs work with a high data volume.

static int FindLargestParallelRange(int[] arr)
{
    object locker = new object();
    int largest = arr[0];
    Parallel.ForEach(Partitioner.Create(0, arr.Length), () => arr[0], (range, loop, subtotal) =>
    {
        for (int i = range.Item1; i < range.Item2; i++)
            if (arr[i] > subtotal)
                subtotal = arr[i];
        return subtotal;
    },
    (finalResult) =>
    {
        lock (locker)
            if (largest < finalResult)
                largest = finalResult;
    });
    return largest;
}

Pay attention to synchronize the localFinally delegate. Also note the need for proper initialization of the localInit: () => arr[0] instead of () => 0.

Partitioning with PLINQ:

static int FindLargestPlinqRange(int[] arr)
{
    return Partitioner.Create(0, arr.Length)
        .AsParallel()
        .Select(range =>
        {
            int largest = arr[0];
            for (int i = range.Item1; i < range.Item2; i++)
                if (arr[i] > largest)
                    largest = arr[i];
            return largest;
        })
        .Max();
}

I highly recommend free book Patterns of Parallel Programming by Stephen Toub.

Parallel.ForEach performance

The reason is most likely the way Parallel.ForEach processes items by default. If you use it on array or something that implements IList (so that total length and indexer is available) - it will split whole workload in batches. Then separate thread will process each batch. That means if batches has different "size" (by size I mean time to processes them) - "small" batches will complete faster.

For example, let's look at this code:

var delays = Enumerable.Repeat(100, 24).Concat(Enumerable.Repeat(2000, 4)).ToArray();
Parallel.ForEach(delays, new ParallelOptions() {MaxDegreeOfParallelism = 4}, d =>
{
    Thread.Sleep(d);
    Console.WriteLine("Done with " + d);
});

If you run it, you will see all "100" (fast) items are processed fast and in parallel. However, all "2000" (slow) items are processed in the end one by one, without any parallelizm at all. That's because all "slow" items are in the same batch. Workload was splitted in 4 batches (MaxDegreeOfParallelism = 4), and first 3 contain only fast items. They are completed fast. Last batch has all slow items and so thread dedicated to this batch will process them one by one.

You can "fix" that for your situation either by ensuring that items are distributed evenly (so that "slow" items are not all together in source collection), or for example with custom partitioner:

var delays = Enumerable.Repeat(100, 24).Concat(Enumerable.Repeat(2000, 4)).ToArray();
var partitioner = Partitioner.Create(delays, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partitioner, new ParallelOptions {MaxDegreeOfParallelism = 4}, d =>
{
    Thread.Sleep(d);
    Console.WriteLine("Done with " + d);
});

NoBuffering ensures that items are taken one at a time, so avoids the problem.

Using another means to parallelize your work (such as SemaphoreSlim, or BlockingCollection) are also an option.

Why my parallel for loop is much slower than for?

The reason for the performance deficit if as Theodor Zoulias points out probably due to it not being thread-safe. This can cause numbers to take arbitrary values and may cause totally different calculations to be performed.

To fix this you need to make each parallel loop independent. As far as I can tell this would be rather easy to do in your example since you only need to ensure that all modified values are local:

 static void Main(string[] args)
 {
        
        BigInteger startvalue =new BigInteger(Math.Pow(2,68));
        Console.WriteLine(startvalue );
        Console.ReadKey();
        for (int i = 1; i > -1; i++)
        {
            Parallel.For(0, 1000000,z=>
             {
                 var currentValue = startvalue + z;
                 while (currentValue != 4)
                 {
                     if (currentValue % 2 == 0)
                     {
                         currentValue = currentValue / 2;
                     }
                     else
                     {
                         currentValue = (currentValue * 3) + 1;
                     }
                 }
             });   
            Console.WriteLine(" {0} times done", i * 1000000);
       }
    }

Another possibility is that the work inside the parallel loop is on average quite small, causing the threading overhead to be significant. There is also the issue of work-balancing, Parallel.For will do work in 'chunks' to reduce this threading overhead, and try to adapt the size of these chunks. If the amount of work is highly variable this adaptation will probably result in inefficiencies.

Foreach object is more faster than foreach -parallel?

TLDR:

There are 3 reasons:

To take full advantage of ForEach-Object -Parallel performance, the processing time of the Script Block needs to be significantly larger than the time to set up the thread and environment.
The Import-Module will introduce an overhead.
Both these factors individually are small, but multiply them by 1000 or a larger number, and they become big.

ForEach-Object -Parallel runs very different than a normal ForEach-Object.

First, a normal ForEach-Object runs inside your current PowerShell thread with access to all the variables, loaded memory, and pipelining. This is fine for 98% of all the jobs we run, and 1 second execution times are ok. In the 2% of times that we have a process that is super CPU intensive that maxes out a single CPU Core and runs forever, or we need to wait for responses (e.g. API requests) when other execution can take place, then -Parallel is what we need to look at.

The idea behind Parallel execution is to take advantage of your brand new AMD Ryzen™ Threadripper™ 3990X with 64 Cores/128 Threads, and split your process into separate "Jobs" that cand run across multiple CPU Cores and multiple threads at the same time. This could increase your speeds by orders of magnitude e.g. potentially 128 times faster.

To achieve this, ForEach-Object -Parallel creates a new "Job" for each script block you execute, and starts spreading the Jobs across CPU Cores for execution. This is great when you have long running CPU bound processes, but when you have very short and small Jobs you hit the crux of Parallel execution, where the setup takes more time than the actual execution. ForEach-Object -Parallel has to completely set up your environment for each "Job" you run, e.g. it has to spin up multiple new threads and multiple new PowerShell instances for every Job to run in.

To illustrate the amount of setup time needed, If we wrote "Hello World" once to the current thread it takes 1 milisecond:

PS C:\> Measure-Command { Write-Host "Hello World" }
Hello World

Seconds           : 0
Milliseconds      : 1
TotalMilliseconds : 1.9798

To run 1 single "Hello World" in Parallel takes 26 miliseconds:

PS C:\> Measure-Command { 1 | ForEach-Object -Parallel { Write-Host "Hello World" } }
Hello World

Seconds           : 0
Milliseconds      : 26
TotalMilliseconds : 26.052

That means that it spent about 25ms spinning up a new thread, and setting up the environment and 1 ms of actual work.

To write it 100 times on the currently running thread takes about 83ms:

PS C:\> Measure-Command { 1..100 | ForEach-Object { Write-Host "Hello World" } }
Hello World
...
Hello World
Hello World

Seconds           : 0
Milliseconds      : 83
TotalMilliseconds : 83.1846

Running in -Parallel with a -ThrottleLimit 5 takes 294ms:

PS C:\> Measure-Command { 1..100 | ForEach-Object -Parallel { Write-Host "Hello World" }  -ThrottleLimit 5 }
Hello World
...
Hello World
Hello World

Seconds           : 0
Milliseconds      : 294
TotalMilliseconds : 294.3205

This goes to show how running in Parallel can be bad for tiny individual operations. But on the flip side, if you have something that takes 1 second to run, you can start to see how it works better:

e.g. run 5 processes that take 1 second each. First on a single thread:

PS C:\> Measure-Command { 1..5 | ForEach-Object { Start-Sleep -Seconds 1 } }

Seconds           : 5
Milliseconds      : 46
TotalSeconds      : 5.046348
TotalMilliseconds : 5046.348

As expected, it takes just over 5 seconds. Now, in Parallel:

PS C:\> Measure-Command { 1..5 | ForEach-Object -Parallel { Start-Sleep -Seconds 1 } -ThrottleLimit 5 }

Seconds           : 1
Milliseconds      : 73
TotalSeconds      : 1.0732423
TotalMilliseconds : 1073.2423

It completes in just over a second. If the processing time takes significantly more time than the setup time, then -Parallel is useful.

Also, in your case, not only do you have extra overhead of the setup time, but loading a module (needed to set up the new environment), adds significantly more time to the ForEach-Object -Parallel version.

For example, lets import a module AzureAD inside our ForEach-Object script 5 times:

PS C:\> Measure-Command { 1..5 | ForEach-Object { Import-Module AzureAD } }

Seconds           : 0
Milliseconds      : 18
TotalSeconds      : 0.0185406
TotalMilliseconds : 18.5406

And now with ForEach-Object -Parallel:

PS C:\> Measure-Command { 1..5 | ForEach-Object -Parallel { Import-Module AzureAD } -ThrottleLimit 5 }

Seconds           : 0
Milliseconds      : 125
TotalSeconds      : 0.1256923
TotalMilliseconds : 125.6923

We can see that there is a significant difference because it has to load the module 5 times as opposed to only a single time inside the thread, then noticing that it is still loaded, and not re-loading it.

Should I always use Parallel.Foreach because more threads MUST speed up everything?

No, it doesn't make sense for every foreach. Some reasons:

Your code may not actually be parallelizable. For example, if you're using the "results so far" for the next iteration and the order is important)
If you're aggregating (e.g. summing values) then there are ways of using Parallel.ForEach for this, but you shouldn't just do it blindly
If your work will complete very fast anyway, there's no benefit, and it may well slow things down

Basically nothing in threading should be done blindly. Think about where it actually makes sense to parallelize. Oh, and measure the impact to make sure the benefit is worth the added complexity. (It will be harder for things like debugging.) TPL is great, but it's no free lunch.

Parallel.Foreach Slower Than Foreach