How Does Openmp Handle Nested Loops

How does OpenMP handle nested loops?

The lines you have written will parallelize only the outer loop. To parallelize both you need to add a collapse clause:

#pragma omp parallel for collapse(2)
    for (int i=0;i<N;i++)
    { 
      for (int j=0;j<M;j++)
      {
       //do task(i,j)//
      }
    }

You may want to check OpenMP 3.1 specifications (sec 2.5.1) for more details.

OpenMP with nested loops

Yes this will create one parallel region where every thread will iterate t over the outer loop, and split up the work of the iterations of the i loops among the threads.

Note that a #pragma omp for has an implicit barrier at the end of it, so there is no need for you to also write your explicit barrier. This implicit barrier can be removed using the nowait clause (i.e. #pragma omp for nowait).

OpenMP - Nested for-loop becomes faster when having parallel before outer loop. Why?

Let's first consider what your code is doing. Essentially your code is transforming a matrix (2D array) where the values of the rows depend on the previous row but the values of the columns are independent of other columns. Let me choose a simpler example of this

for(int i=1; i<n; i++) {
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

One way to parallelize this is to swap the loops like this

Method 1:

#pragma omp parallel for
for(int j=0; j<n; j++) {
    for(int i=1; i<n; i++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

With this method each thread runs all n-1 iterations of i of the inner loop but only n/nthreads iterations of j. This effectively processes strips of columns in parallel. However, this method is highly cache unfriendly.

Another possibility is to only parallelize the inner loop.

Method 2:

for(int i=1; i<n; i++) {
    #pragma omp parallel for 
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

This essentially processes the columns in a single row in parallel but each row sequentially. The values of i are only run by the master thread.

Another way to process the columns in parallel but each row sequentially is:

Method 3:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In this method, like method 1, each thread runs over all n-1 iteration over i. However, this method has an implicit barrier after the inner loop which causes each thread to pause until all threads have finished a row making this method sequential for each row like method 2.

The best solution is one which processes strips of columns in parallel like method 1 but is still cache friendly. This can be achieved using the nowait clause.

Method 4:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for nowait
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In my tests the nowait clause does not make much difference. This is probably because the load is even (which is why static scheduling is ideal in this case). If the load was less even nowait would probably make more of a difference.

Here are the times in seconds for n=3000 on my my four cores IVB system GCC 4.9.2:

method 1: 3.00
method 2: 0.26 
method 3: 0.21
method 4: 0.21

This test is probably memory bandwidth bound so I could have chosen a better case using more computation but nevertheless the differences are significant enough. In order to remove a bias due to creating the thread pool I ran one of the methods without timing it first.

It's clear from the timing how un-cache friendly method 1 is. It's also clear method 3 is faster than method 2 and that nowait has little effect in this case.

Since method 2 and method 3 both processes columns in a row in parallel but rows sequentially one might expect their timing to be the same. So why do they differ? Let me make some observations:

Due to a thread pool the threads are not created and destroyed for each iteration of the outer loop of method 2 so it's not clear to me what the extra overhead is. Note that OpenMP says nothing about a thread pool. This is something that each compiler implements.
The only other difference between method 3 and method 2 is that in method 2 only the master thread processes i whereas in method 3 each thread processes a private i. But this seems too trivially to me to explain the significant difference between the methods because the implicit barrier in method 3 causes them to sync anyway and processing i is a matter of an increment and a conditional test.
The fact that method 3 is no slower than method 4 which processes whole strips of columns in parallel says the extra overhead in method 2 is all in leaving and entering a parallel region for each iteration of i

So my conclusion is that to explain why method 2 is so much slower than method 3 requires looking into the implementation of the thread pool. For GCC which uses pthreads, this could probably be explained by creating a toy model of a thread pool but I don't have enough experience with that yet.

C++ OpenMP: nested loops where the inner iterator depends on the outer one

The collapse clause should not be used when the iterations are depended on another loop. See Understanding the collapse clause in openmp.

In your case you are running over the lower triangle of a matrix (excluding the diagonal) because of symmetry. This cuts the number of iterations roughly in half. If you want to fuse/collapse the double loop you can do it by hand like this (see the end of this answer for more details).

for(size_t k=0; k<n*(n-1)/2; k++) {
    size_t i = k/n, j = k%n;
    if(j<=i) i = n - i - 2, j = n - j - 1;
    double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
    D[i][j] = d;
    D[j][i] = d;
}

I think most people assume that collapsing a loop is going to give better performance but this is often not the case. In my experience most of the time there is no difference in performance but in some cases it's much worse due to cache issues. In a few cases it's better. You have to test yourself.

As to why your code was twice as slow with the collapse clause I can only guess that since the effect is unspecified for the inner loop that your OpenMP implementation ran over j from [0,n) i.e. the full matrix rather than half the matrix.

What happens to cycle loop nested in openmp do loop?

Yes, you can use cycle in loop1. That loop will be executed by each thread independently and the conditional jumps within the context of each thread do not pose any problem.

Just don't forget to make j private! Only i will be made private automatically.

By the way, OpenMP allows use of cycle even for the parallel do; the document states that

Only an iteration of the innermost associated loop may be curtailed by a CYCLE statement.

So in your case you could use cycle even in the outer loop, which is your only loop associated with the parallel section (i.e., there aren't several collapse-d loops).

Nested loop OpenMP Parallellizing, private or public index?

Private j has no effect as j is private by defaults(since it is scoped within the i for loop so when a new thread is created j is specific to that thread

#pragma omp parallel for private(j)
for (int i = 0, ...) {
     for (int j = 0, ...) { }}

If you use shared j it should have no effect as stated above since the scope of j is local to each instance of i, if you expand the scope of j to global, you will encounter a race codition

How Does Openmp Handle Nested Loops