Openmp Nested Parallel for Loops VS Inner Parallel For

openMP nested parallel for loops vs inner parallel for

If your compiler supports OpenMP 3.0, you can use the collapse clause:

#pragma omp parallel for schedule(dynamic,1) collapse(2)
for (int x = 0; x < x_max; ++x) {
    for (int y = 0; y < y_max; ++y) { 
    //parallelize this code here
    }
//IMPORTANT: no code in here
}

If it doesn't (e.g. only OpenMP 2.5 is supported), there is a simple workaround:

#pragma omp parallel for schedule(dynamic,1)
for (int xy = 0; xy < x_max*y_max; ++xy) {
    int x = xy / y_max;
    int y = xy % y_max;
    //parallelize this code here
}

You can enable nested parallelism with omp_set_nested(1); and your nested omp parallel for code will work but that might not be the best idea.

By the way, why the dynamic scheduling? Is every loop iteration evaluated in non-constant time?

Is there a difference between nested parallelism and collapsed for loops?

Yes there is a huge difference - use collapse (not collapsed). Do not use nested parallelism.

Nested parallelism means that there are independent teams of threads working on the different levels of worksharing. You can run into all sorts of trouble either with oversubscribing CPU cores to too many threads - or not utilizing CPU cores because some threads are in the wrong team which has no work right now. It's rather hard to get decent performance out of nested parallelism. This is why you usually need to explicitly enable it.

Collapsing loops on the other hand means that the different loops are joint on a work-sharing level. This allows one team of threads (usually with as many threads as available CPU cores) to efficiently work the different iterations of the loops.

OpenMP - Nested for-loop becomes faster when having parallel before outer loop. Why?

Let's first consider what your code is doing. Essentially your code is transforming a matrix (2D array) where the values of the rows depend on the previous row but the values of the columns are independent of other columns. Let me choose a simpler example of this

for(int i=1; i<n; i++) {
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

One way to parallelize this is to swap the loops like this

Method 1:

#pragma omp parallel for
for(int j=0; j<n; j++) {
    for(int i=1; i<n; i++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

With this method each thread runs all n-1 iterations of i of the inner loop but only n/nthreads iterations of j. This effectively processes strips of columns in parallel. However, this method is highly cache unfriendly.

Another possibility is to only parallelize the inner loop.

Method 2:

for(int i=1; i<n; i++) {
    #pragma omp parallel for 
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

This essentially processes the columns in a single row in parallel but each row sequentially. The values of i are only run by the master thread.

Another way to process the columns in parallel but each row sequentially is:

Method 3:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In this method, like method 1, each thread runs over all n-1 iteration over i. However, this method has an implicit barrier after the inner loop which causes each thread to pause until all threads have finished a row making this method sequential for each row like method 2.

The best solution is one which processes strips of columns in parallel like method 1 but is still cache friendly. This can be achieved using the nowait clause.

Method 4:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for nowait
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In my tests the nowait clause does not make much difference. This is probably because the load is even (which is why static scheduling is ideal in this case). If the load was less even nowait would probably make more of a difference.

Here are the times in seconds for n=3000 on my my four cores IVB system GCC 4.9.2:

method 1: 3.00
method 2: 0.26 
method 3: 0.21
method 4: 0.21

This test is probably memory bandwidth bound so I could have chosen a better case using more computation but nevertheless the differences are significant enough. In order to remove a bias due to creating the thread pool I ran one of the methods without timing it first.

It's clear from the timing how un-cache friendly method 1 is. It's also clear method 3 is faster than method 2 and that nowait has little effect in this case.

Since method 2 and method 3 both processes columns in a row in parallel but rows sequentially one might expect their timing to be the same. So why do they differ? Let me make some observations:

Due to a thread pool the threads are not created and destroyed for each iteration of the outer loop of method 2 so it's not clear to me what the extra overhead is. Note that OpenMP says nothing about a thread pool. This is something that each compiler implements.
The only other difference between method 3 and method 2 is that in method 2 only the master thread processes i whereas in method 3 each thread processes a private i. But this seems too trivially to me to explain the significant difference between the methods because the implicit barrier in method 3 causes them to sync anyway and processing i is a matter of an increment and a conditional test.
The fact that method 3 is no slower than method 4 which processes whole strips of columns in parallel says the extra overhead in method 2 is all in leaving and entering a parallel region for each iteration of i

So my conclusion is that to explain why method 2 is so much slower than method 3 requires looking into the implementation of the thread pool. For GCC which uses pthreads, this could probably be explained by creating a toy model of a thread pool but I don't have enough experience with that yet.

C++ OpenMP: nested loops where the inner iterator depends on the outer one

The collapse clause should not be used when the iterations are depended on another loop. See Understanding the collapse clause in openmp.

In your case you are running over the lower triangle of a matrix (excluding the diagonal) because of symmetry. This cuts the number of iterations roughly in half. If you want to fuse/collapse the double loop you can do it by hand like this (see the end of this answer for more details).

for(size_t k=0; k<n*(n-1)/2; k++) {
    size_t i = k/n, j = k%n;
    if(j<=i) i = n - i - 2, j = n - j - 1;
    double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
    D[i][j] = d;
    D[j][i] = d;
}

I think most people assume that collapsing a loop is going to give better performance but this is often not the case. In my experience most of the time there is no difference in performance but in some cases it's much worse due to cache issues. In a few cases it's better. You have to test yourself.

As to why your code was twice as slow with the collapse clause I can only guess that since the effect is unspecified for the inner loop that your OpenMP implementation ran over j from [0,n) i.e. the full matrix rather than half the matrix.

OpenMP On-Demand Nested Parallelism

Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:

void processAllJobs()
{
#pragma omp parallel master
    for(int i = 0; i < n; ++i) 
#pragma omp task
        processJob(i);
}

Then the processing of the job would look like this:

void processJob(int i)
{
    for(int iteration = 0; iteration < iterationCount; ++iteration)
    {
        doSomePreparation(i);
        std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop   // add grainsize() clause, if Process() is very short
        for(int j = 0; j < substasks.size(); ++j)
            subtasks[j].Process();
        doSomePostProcessing(i)
    }
}

That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.

Pragma omp parallel for inside inner loop is not correctly ignored in nested loop

I'm not sure I understand what is that you expected, but the result you get is perfectly expected.

By default, OpenMP nested parallelism is disabled, meaning that any nested parallel region will create as many teams of 1 thread as there are of threads from the outer level encountering them.

In your case, you outermost parallel region creates a team of 8 threads. Each of these will reach the innermost parallel region, and create a second level 1-thread team. Each of these second level thread, in its own team, is ranked 0, hence the printed 0s you have.

With the very same code, compiled with g++ 9.3.0, by setting the 2 environment variables OMP_NUM_THREADS and OMP_NESTED, I get the following:

OMP_NUM_THREADS="2,3" OMP_NESTED=true ./a.out 
Calling 0 from 0
Calling 5 from 1
    Calling 0 from 0
    Calling 1 from 0
    Calling 2 from 1
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2
    Calling 3 from 2
    Calling 2 from 1
Calling 6 from 1
Calling 1 from 0
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2
    Calling 2 from 1
Calling 2 from 0
    Calling 0 from 0
    Calling 1 from 0
    Calling 2 from 1
    Calling 3 from 2
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2
    Calling 2 from 1
Calling 3 from 0
Calling 7 from 1
    Calling 0 from 0
    Calling 3 from 2
    Calling 2 from 1
    Calling 3 from 2
    Calling 0 from 0
    Calling 1 from 0
    Calling 1 from 0
    Calling 2 from 1
Calling 4 from 0
Calling 8 from 1
    Calling 0 from 0
    Calling 3 from 2
    Calling 2 from 1
    Calling 2 from 1
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2
    Calling 1 from 0
Calling 9 from 1
    Calling 2 from 1
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2

Maybe that corresponds better to what you expected to see?

Openmp Nested Parallel for Loops VS Inner Parallel For