Openmp: What Is the Benefit of Nesting Parallelizations

OpenMP: What is the benefit of nesting parallelizations?

(1) Nested parallelism in OpenMP:
http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html

You need to turn on nested parallelism by setting OMP_NESTED or omp_set_nested because many implementations turn off this feature by default, even some implementations didn't support nested parallelism fully. If turned on, whenever you meet parallel for, OpenMP will create the number of threads as defined in OMP_NUM_THREADS. So, if 2-level parallelism, the total number of threads would be N^2, where N = OMP_NUM_THREADS.

Such nested parallelism will cause oversubscription, (i.e., the number of busy threads is greater than the cores), which may degrade the speedup. In an extreme case, where nested parallelism is called recursively, threads could be bloated (e.g., creating 1000s threads), and computer just wastes time for context switching. In such case, you may control the number of threads dynamically by setting omp_set_dynamic.

(2) An example of matrix-vector multiplication: the code would look like:

// Input:  A(N by M), B(M by 1)
// Output: C(N by 1)
for (int i = 0; i < N; ++i)
  for (int j = 0; j < M; ++j)
     C[i] += A[i][j] * B[j];

In general, parallelizing inner loops while outer loops are possible is bad because of forking/joining overhead of threads. (though many OpenMP implementations pre-create threads, it still requires some to dispatch tasks to threads and to call implicit barrier at the end of parallel-for)

Your concern is the case of where N < # of CPU. Yes, right, in this case, the speedup would be limited by N, and letting nested parallelism will definitely have benefits.

However, then the code would cause oversubscription if N is sufficiently large. I'm just thinking the following solutions:

Changing the loop structure so that only 1-level loop exists. (It looks doable)
Specializing the code: if N is small, then do nested parallelism, otherwise don't do that.
Nested parallelism with omp_set_dynamic. But, please make it sure how omp_set_dynamic controls the number of threads and the activity of threads. Implementations may vary.

How does OpenMP handle nested loops?

The lines you have written will parallelize only the outer loop. To parallelize both you need to add a collapse clause:

#pragma omp parallel for collapse(2)
    for (int i=0;i<N;i++)
    { 
      for (int j=0;j<M;j++)
      {
       //do task(i,j)//
      }
    }

You may want to check OpenMP 3.1 specifications (sec 2.5.1) for more details.

OpenMP - Nested for-loop becomes faster when having parallel before outer loop. Why?

Let's first consider what your code is doing. Essentially your code is transforming a matrix (2D array) where the values of the rows depend on the previous row but the values of the columns are independent of other columns. Let me choose a simpler example of this

for(int i=1; i<n; i++) {
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

One way to parallelize this is to swap the loops like this

Method 1:

#pragma omp parallel for
for(int j=0; j<n; j++) {
    for(int i=1; i<n; i++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

With this method each thread runs all n-1 iterations of i of the inner loop but only n/nthreads iterations of j. This effectively processes strips of columns in parallel. However, this method is highly cache unfriendly.

Another possibility is to only parallelize the inner loop.

Method 2:

for(int i=1; i<n; i++) {
    #pragma omp parallel for 
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

This essentially processes the columns in a single row in parallel but each row sequentially. The values of i are only run by the master thread.

Another way to process the columns in parallel but each row sequentially is:

Method 3:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In this method, like method 1, each thread runs over all n-1 iteration over i. However, this method has an implicit barrier after the inner loop which causes each thread to pause until all threads have finished a row making this method sequential for each row like method 2.

The best solution is one which processes strips of columns in parallel like method 1 but is still cache friendly. This can be achieved using the nowait clause.

Method 4:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for nowait
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In my tests the nowait clause does not make much difference. This is probably because the load is even (which is why static scheduling is ideal in this case). If the load was less even nowait would probably make more of a difference.

Here are the times in seconds for n=3000 on my my four cores IVB system GCC 4.9.2:

method 1: 3.00
method 2: 0.26 
method 3: 0.21
method 4: 0.21

This test is probably memory bandwidth bound so I could have chosen a better case using more computation but nevertheless the differences are significant enough. In order to remove a bias due to creating the thread pool I ran one of the methods without timing it first.

It's clear from the timing how un-cache friendly method 1 is. It's also clear method 3 is faster than method 2 and that nowait has little effect in this case.

Since method 2 and method 3 both processes columns in a row in parallel but rows sequentially one might expect their timing to be the same. So why do they differ? Let me make some observations:

Due to a thread pool the threads are not created and destroyed for each iteration of the outer loop of method 2 so it's not clear to me what the extra overhead is. Note that OpenMP says nothing about a thread pool. This is something that each compiler implements.
The only other difference between method 3 and method 2 is that in method 2 only the master thread processes i whereas in method 3 each thread processes a private i. But this seems too trivially to me to explain the significant difference between the methods because the implicit barrier in method 3 causes them to sync anyway and processing i is a matter of an increment and a conditional test.
The fact that method 3 is no slower than method 4 which processes whole strips of columns in parallel says the extra overhead in method 2 is all in leaving and entering a parallel region for each iteration of i

So my conclusion is that to explain why method 2 is so much slower than method 3 requires looking into the implementation of the thread pool. For GCC which uses pthreads, this could probably be explained by creating a toy model of a thread pool but I don't have enough experience with that yet.

Nested loop in OpenMP performance issue

And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores)

But (surprise!) the elapsed time is:
about 11 s

It is not a surprise at all. Parallel blocks perform implicit barriers and can even join and create threads (some libraries may use thread pools to reduce the cost of thread creation).

In the end, opening parallel regions is expensive. You should do it as few times as possible. The threads will run the outer loops in parallel, at the same time, but will divide the iteration space once they reach the omp for block, so the result should still be correct (you should make your program check this if you are unsure).

For testing performance, you should always run your experiments turning compiler optimizations, as they have a heavy impact on the behavior of the application (you should not make assumptions about performance on unoptimized programs because their problems may be already addressed during optimization).

When making a single parallel block that contains all the loops, the execution time is halved in my setup (started with 9.536s using 2 threads, and reduced to 4.757s).

The omp for block still applies implicit barriers, which is not needed in your example. Adding the nowait clause to the example reduces the execution time by another half: 2.120s.

From this point, you can now try to explore the other options.

Parallelizing middle loop reduces execution time to only 0.732s due to much better usage of the memory hierarchy and vectorization. L1 miss ratio reduced from ~29% to ~0.3%.

Using collapse with the two innermost loops made no big deal using two threads (strong scaling should be checked).

Using other directives such as omp simd does not improve performance in this case, as the compiler is sure enough that it can vectorize the innermost loop safely.

#pragma omp parallel reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
    #pragma omp for schedule(static) nowait
    for (int i=0; i<N; i++) {
        for (int j=0; j<N; j++) {
            arr[i][j] = brr[i][j];
            crr[i][j] = arr[i][j] - brr[i][j];
            sum1 += crr[i][j];
            sum2 += arr[i][j];
        }
    }
}

Note: L1 miss ratio computed using perf:

$  perf stat -e cache-references,cache-misses -r 3 ./test

Performance problems using OpenMP in nested loops

The actual problem was not related to OpenMP directly.

As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.

The solution was to use thread pinning for example via the linux tool: taskset

Openmp: What Is the Benefit of Nesting Parallelizations