How Does the Omp Ordered Clause Work

How does the omp ordered clause work?

The ordered clause works like this: different threads execute concurrently until they encounter the ordered region, which is then executed sequentially in the same order as it would get executed in a serial loop. This still allows for some degree of concurrency, especially if the code section outside the ordered region has substantial run time.

There is no particular reason to use dynamic schedule instead of static schedule with small chunk size. It all depends on the structure of the code. Since ordered introduces dependency between the threads, if used with schedule(static) with default chunk size, the second thread would have to wait for the first one to finish all iterations, then the third thread would have to wait for the second one to finish its iterations (and hence for the first one too), and so on. One could easily visualise it with 3 threads and 9 iterations (3 per thread):

tid  List of     Timeline
     iterations
0    0,1,2       ==o==o==o
1    3,4,5       ==.......o==o==o
2    6,7,8       ==..............o==o==o

= shows that the thread is executing code in parallel. o is when the thread is executing the ordered region. . is the thread being idle, waiting for its turn to execute the ordered region. With schedule(static,1) the following would happen:

tid  List of     Timeline
     iterations
0    0,3,6       ==o==o==o
1    1,4,7       ==.o==o==o
2    2,5,8       ==..o==o==o

I believe the difference in both cases is more than obvious. With schedule(dynamic) the pictures above would become more or less random as the list of iterations assigned to each thread is non-deterministic. It would also add an additional overhead. It is only useful if the amount of computation is different for each iteration and it takes much more time to do the computation than is the added overhead of using dynamic scheduling.

Don't worry about the lowest numbered iteration. It is usually handled to the first thread in the team to become ready to execute code.

OpenMP ordered clause

The OpenMP ordered clause is supposed to be used in two distinct stages:

As a clause of the for directive, to indicate some possible ordering of iterations; and then
As a standalone directive to indicate which parts of the statements inside the loop are to be kept ordered.

So in essence, your example should be:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(){

    int n;
    omp_set_num_threads(4);
    #pragma omp parallel
    {
        #pragma omp for ordered
        for (n=0;n<10;n++)
            #pragma omp ordered
            printf("n = %d\n",n);
    }
    return 0;
}

Which gives:

$ g++ -fopenmp order.cc
$ ./a.out 
n = 0
n = 1
n = 2
n = 3
n = 4
n = 5
n = 6
n = 7
n = 8
n = 9

However, two remarks here:

This parallel loop is actually fully sequentialised as there is nothing outside of the ordered directive
Moreover, if you which to mitigate the impact of serialisation induced by the ordered directive, you should play with the schedule directive. A good candidate for an "effective" scheduling in the case of an ordered loop would be a schedule( static, 1 ) for example.

And finally, Hristo Iliev explains it all better than I'll ever be able to here

OpenMP parallel for with ordered and critical directives

I am failing to figure out why s contains the i indices (first number) in a strict sequence despite the absence of the ordered directives and why printf("i: %d tid: %d\n", i, tid) shows them in a different order?

With static scheduling there is a fixed mapping between loop iteration and thread that executes it, which is why no matter how many times you run the program, if the number of threads is kept the same, s[i] will always be set to "i:same_thread_id". Printing s[] takes place in a sequential loop outside the parallel region, hence the output is ordered. I would be more surprised if that loop were to print things out of order. As for the printf() calls within the parallel region, you have schedule(static,1), which means each iteration gets executed by a different thread, and those run in arbitrary order.

Adding ordered to the omp parallel for clause doesn't seem to change anything unless omp ordered is put inside the loop body.

That is exactly how ordered works. There are the ordered clause and the ordered region. The clause modifies the behaviour of the for worksharing construct and enables ordered execution of the denoted region inside. There are additional synchronisation requirements for ordered execution to work properly that aren't needed otherwise, which is why the clause exists. Also, the region exists so that only a part of the loop can run in order. Having the entire loop body run in order is meaningless as it is no different from sequential (non-parallel) loop execution. See this answer of mine for more details.

This is totally bewildering since in my understanding the critical section must guarantee that sprintf and printf statements are executed by the same thread without any interruptions.

Critical sections guarantee that no two threads execute the same code region simultaneously. They in no way enforce the order of the encountering threads. Since no two threads access the same element of s[], having a critical section in 3. changes nothing. It serialises the loop execution since no two threads can execute the body at the same time, but it doesn't make the loop run in sequential order.

Does an OpenMP ordered for always assign parts of the loop to threads in order, too?

No, the code in its current state is not portable. It will work only if the default loop schedule is static, that is, the iteration space is divided into count / #threads contiguous chunks and then assigned to the threads in the order of their thread ID with a guaranteed mapping between chunk and thread ID. But the OpenMP specification does not prescribe any default schedule and leaves it to the implementation to pick one. Many implementations use static, but that is not guaranteed to always be the case.

If you add schedule(static) to all loop constructs, then the combination of ordered clause and ordered construct within each loop body will ensure that thread 0 will receive the the first chunk of iterations and will also be the first one to execute the ordered construct. For the loops that run over the number of threads, the chunk size will be one, i.e. each thread will execute exactly one iteration and the order of the iterations of the parallel loop will match those of a sequential loop. The 1:1 mapping of iteration number to thread ID done by the static schedule will then ensure the behaviour you are aiming for.

Note that if the first loop, where you initialise the thread-local PRNGs, is in a different parallel region, you must ensure that both parallel regions execute with the same number of threads, e.g., by disabling dynamic team sizing (omp_set_dynamic(0);) or by explicit application of the num_threads clause.

As to the significance of the ordered clause + construct, it does not influence the assignment of iterations to threads, but it synchronises the threads and makes sure that the physical execution order will match the logical one. A statically scheduled loop without an ordered clause will still assign iteration 0 to thread 0, but there will be no guarantee that some other thread won't execute its loop body ahead of thread 0. Also, any code in the loop body outside of the ordered construct is still allowed to execute concurrently and out of order - see here for a more detailed explanation.

Difference between omp ordered and omp critical

omp critical is for mutual exclusion, omp ordered refers to a specific loop and ensures that the region executes sequentually in the order of loop iterations. Therefore omp ordered is stronger than omp critical, but also only makes sense within a loop.

omp ordered has some other clauses, such as simd to enforce the use of a single SIMD lane only. You can also specify dependencies manually with the depend clause.

Note: Both omp critical and omp ordered regions have an implicit memory flush at the entry and the exit.

openmp ordering critical sections

Your attempt of ordering threads with #pragma omp critical is totally incorrect. There can be just one thread in a critical section at any time, and the order in which the threads arrive to the critical section is not determined. So in your code it can happen that e.g. the thread #2 enters the critical section first and never leaves it, waiting for thread #1 to complete, while the thread #1 and the rest are waiting at #pragma omp critical. And even if some threads, e.g. thread #0, are lucky to complete the critical section in right order, they will wait on an implicit barrier at the end of the parallel region. In other words, the deadlock is almost guaranteed in this code.

I suggest you do something much simpler and natural to order your threads, namely an ordered section. It should look like this:

#pragma omp parallel
{
    int ID = omp_get_thread_num();

    // Computations done by each thread

    #pragma omp for ordered schedule(static,1)
    for( int t=0; t<omp_get_num_threads(); ++t )
    {
        assert( t==ID );
        #pragma omp ordered
        {
            // Do the stuff you want to be in order
        }
    }
}

So you create a parallel loop with the number of iterations equal to the number of threads in the region. The schedule(static,1) clause makes it explicit that the iterations are assigned one per thread in the order of thread IDs; and the ordered clause allows to use ordered sections inside the loop. Now in the body of the loop you put an ordered section (the block following #pragma omp ordered), and it will be executed in the order of iterations, which is also the order of thread IDs (as ensured by the assertion).

For more information, you may look at this question: How does the omp ordered clause work?

Two openmp ordered blocks with no resulting parallelization

This program is non-conforming to the OpenMP standard. Specifically, the problem is that you have more than one ordered region and every iteration of your loop will execute both of them. The OpenMP 4.0 standard has this to say (2.12.8, Restrictions, line 16, p 139):

During execution of an iteration of a loop or a loop nest within a loop region, a thread must not execute more than one ordered region that binds to the same loop
region.

If you have more than one ordered region, you must have conditional code paths such that only one of them can be executed for any loop iteration.

It is also worth noting the position of your ordered region seems to have performance implications. Testing with gfortran 5.2, it appears everything after the ordered region is executed in order for each loop iteration, so having the ordered block at the beginning of the loop leads to serial performance while having the ordered block at the end of the loop does not have this implication as the code before the block is parallelized. Testing with ifort 15 is not as dramatic but I would still recommend structuring your code so your ordered block occurs after any code than needs parallelization in a loop iteration rather than before.

Is there a way to cancel from inside ordered clause?

No. It is not possible to cancel an ordered loop.

A loop construct that is canceled must not have an ordered clause.

(cf. 2.14.1 of the OpenMP standard)

One of the workaround to emulate cancellation is to add a skip at the beginning of the loop, e.g.

#pragma omp parallel for ordered schedule(dynamic) shared(prime_counter)
for (candidate = start; candidate <= end; candidate += 2) {
  if (prime_counter >= max_primes) {
    continue;
  }
  if (is_prime(candidate)) {

However, that is not yet a thread safe access to prime_counter. In order to avoid race conditions, you must do something along the lines of:

  int local_prime_counter;
  #pragma omp atomic read
  local_prime_counter = prime_counter;
  if (local_prime_counter >= max_primes)

  ...

  #pragma omp atomic update
  prime_counter++;

P.S. I'm not quite 100% sure if it is standard conforming to have a conditional ordered construct.

How Does the Omp Ordered Clause Work