How does the omp ordered clause work?
The ordered
clause works like this: different threads execute concurrently until they encounter the ordered
region, which is then executed sequentially in the same order as it would get executed in a serial loop. This still allows for some degree of concurrency, especially if the code section outside the ordered
region has substantial run time.
There is no particular reason to use dynamic
schedule instead of static
schedule with small chunk size. It all depends on the structure of the code. Since ordered
introduces dependency between the threads, if used with schedule(static)
with default chunk size, the second thread would have to wait for the first one to finish all iterations, then the third thread would have to wait for the second one to finish its iterations (and hence for the first one too), and so on. One could easily visualise it with 3 threads and 9 iterations (3 per thread):
tid List of Timeline
iterations
0 0,1,2 ==o==o==o
1 3,4,5 ==.......o==o==o
2 6,7,8 ==..............o==o==o
=
shows that the thread is executing code in parallel. o
is when the thread is executing the ordered
region. .
is the thread being idle, waiting for its turn to execute the ordered
region. With schedule(static,1)
the following would happen:
tid List of Timeline
iterations
0 0,3,6 ==o==o==o
1 1,4,7 ==.o==o==o
2 2,5,8 ==..o==o==o
I believe the difference in both cases is more than obvious. With schedule(dynamic)
the pictures above would become more or less random as the list of iterations assigned to each thread is non-deterministic. It would also add an additional overhead. It is only useful if the amount of computation is different for each iteration and it takes much more time to do the computation than is the added overhead of using dynamic scheduling.
Don't worry about the lowest numbered iteration. It is usually handled to the first thread in the team to become ready to execute code.
OpenMP ordered clause
The OpenMP ordered
clause is supposed to be used in two distinct stages:
- As a clause of the
for
directive, to indicate some possible ordering of iterations; and then - As a standalone directive to indicate which parts of the statements inside the loop are to be kept ordered.
So in essence, your example should be:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(){
int n;
omp_set_num_threads(4);
#pragma omp parallel
{
#pragma omp for ordered
for (n=0;n<10;n++)
#pragma omp ordered
printf("n = %d\n",n);
}
return 0;
}
Which gives:
$ g++ -fopenmp order.cc
$ ./a.out
n = 0
n = 1
n = 2
n = 3
n = 4
n = 5
n = 6
n = 7
n = 8
n = 9
However, two remarks here:
- This parallel loop is actually fully sequentialised as there is nothing outside of the
ordered
directive - Moreover, if you which to mitigate the impact of serialisation induced by the
ordered
directive, you should play with theschedule
directive. A good candidate for an "effective" scheduling in the case of an ordered loop would be aschedule( static, 1 )
for example.
And finally, Hristo Iliev explains it all better than I'll ever be able to here
OpenMP parallel for with ordered and critical directives
I am failing to figure out why
s
contains thei
indices (first number) in a strict sequence despite the absence of the ordered directives and whyprintf("i: %d tid: %d\n", i, tid)
shows them in a different order?
With static scheduling there is a fixed mapping between loop iteration and thread that executes it, which is why no matter how many times you run the program, if the number of threads is kept the same, s[i]
will always be set to "i:same_thread_id"
. Printing s[]
takes place in a sequential loop outside the parallel region, hence the output is ordered. I would be more surprised if that loop were to print things out of order. As for the printf()
calls within the parallel region, you have schedule(static,1)
, which means each iteration gets executed by a different thread, and those run in arbitrary order.
Adding
ordered
to theomp parallel for
clause doesn't seem to change anything unlessomp ordered
is put inside the loop body.
That is exactly how ordered
works. There are the ordered
clause and the ordered
region. The clause modifies the behaviour of the for
worksharing construct and enables ordered execution of the denoted region inside. There are additional synchronisation requirements for ordered execution to work properly that aren't needed otherwise, which is why the clause exists. Also, the region exists so that only a part of the loop can run in order. Having the entire loop body run in order is meaningless as it is no different from sequential (non-parallel) loop execution. See this answer of mine for more details.
This is totally bewildering since in my understanding the critical section must guarantee that
sprintf
andprintf
statements are executed by the same thread without any interruptions.
Critical sections guarantee that no two threads execute the same code region simultaneously. They in no way enforce the order of the encountering threads. Since no two threads access the same element of s[]
, having a critical section in 3. changes nothing. It serialises the loop execution since no two threads can execute the body at the same time, but it doesn't make the loop run in sequential order.
Does an OpenMP ordered for always assign parts of the loop to threads in order, too?
No, the code in its current state is not portable. It will work only if the default loop schedule is static
, that is, the iteration space is divided into count / #threads
contiguous chunks and then assigned to the threads in the order of their thread ID with a guaranteed mapping between chunk and thread ID. But the OpenMP specification does not prescribe any default schedule and leaves it to the implementation to pick one. Many implementations use static
, but that is not guaranteed to always be the case.
If you add schedule(static)
to all loop constructs, then the combination of ordered
clause and ordered
construct within each loop body will ensure that thread 0 will receive the the first chunk of iterations and will also be the first one to execute the ordered
construct. For the loops that run over the number of threads, the chunk size will be one, i.e. each thread will execute exactly one iteration and the order of the iterations of the parallel loop will match those of a sequential loop. The 1:1 mapping of iteration number to thread ID done by the static
schedule will then ensure the behaviour you are aiming for.
Note that if the first loop, where you initialise the thread-local PRNGs, is in a different parallel region, you must ensure that both parallel regions execute with the same number of threads, e.g., by disabling dynamic team sizing (omp_set_dynamic(0);
) or by explicit application of the num_threads
clause.
As to the significance of the ordered
clause + construct, it does not influence the assignment of iterations to threads, but it synchronises the threads and makes sure that the physical execution order will match the logical one. A statically scheduled loop without an ordered
clause will still assign iteration 0 to thread 0, but there will be no guarantee that some other thread won't execute its loop body ahead of thread 0. Also, any code in the loop body outside of the ordered
construct is still allowed to execute concurrently and out of order - see here for a more detailed explanation.
Difference between omp ordered and omp critical
omp critical
is for mutual exclusion, omp ordered
refers to a specific loop and ensures that the region executes sequentually in the order of loop iterations. Therefore omp ordered
is stronger than omp critical
, but also only makes sense within a loop.
omp ordered
has some other clauses, such as simd
to enforce the use of a single SIMD lane only. You can also specify dependencies manually with the depend
clause.
Note: Both omp critical
and omp ordered
regions have an implicit memory flush at the entry and the exit.
openmp ordering critical sections
Your attempt of ordering threads with #pragma omp critical
is totally incorrect. There can be just one thread in a critical section at any time, and the order in which the threads arrive to the critical section is not determined. So in your code it can happen that e.g. the thread #2 enters the critical section first and never leaves it, waiting for thread #1 to complete, while the thread #1 and the rest are waiting at #pragma omp critical
. And even if some threads, e.g. thread #0, are lucky to complete the critical section in right order, they will wait on an implicit barrier at the end of the parallel region. In other words, the deadlock is almost guaranteed in this code.
I suggest you do something much simpler and natural to order your threads, namely an ordered section. It should look like this:
#pragma omp parallel
{
int ID = omp_get_thread_num();
// Computations done by each thread
#pragma omp for ordered schedule(static,1)
for( int t=0; t<omp_get_num_threads(); ++t )
{
assert( t==ID );
#pragma omp ordered
{
// Do the stuff you want to be in order
}
}
}
So you create a parallel loop with the number of iterations equal to the number of threads in the region. The schedule(static,1)
clause makes it explicit that the iterations are assigned one per thread in the order of thread IDs; and the ordered
clause allows to use ordered sections inside the loop. Now in the body of the loop you put an ordered section (the block following #pragma omp ordered
), and it will be executed in the order of iterations, which is also the order of thread IDs (as ensured by the assertion).
For more information, you may look at this question: How does the omp ordered clause work?
Two openmp ordered blocks with no resulting parallelization
This program is non-conforming to the OpenMP standard. Specifically, the problem is that you have more than one ordered
region and every iteration of your loop will execute both of them. The OpenMP 4.0 standard has this to say (2.12.8, Restrictions, line 16, p 139):
During execution of an iteration of a loop or a loop nest within a loop region, a thread must not execute more than one ordered region that binds to the same loop
region.
If you have more than one ordered
region, you must have conditional code paths such that only one of them can be executed for any loop iteration.
It is also worth noting the position of your ordered region seems to have performance implications. Testing with gfortran 5.2, it appears everything after the ordered region is executed in order for each loop iteration, so having the ordered block at the beginning of the loop leads to serial performance while having the ordered block at the end of the loop does not have this implication as the code before the block is parallelized. Testing with ifort 15 is not as dramatic but I would still recommend structuring your code so your ordered block occurs after any code than needs parallelization in a loop iteration rather than before.
Is there a way to cancel from inside ordered clause?
No. It is not possible to cancel an ordered loop.
A loop construct that is canceled must not have an ordered clause.
(cf. 2.14.1 of the OpenMP standard)
One of the workaround to emulate cancellation is to add a skip at the beginning of the loop, e.g.
#pragma omp parallel for ordered schedule(dynamic) shared(prime_counter)
for (candidate = start; candidate <= end; candidate += 2) {
if (prime_counter >= max_primes) {
continue;
}
if (is_prime(candidate)) {
However, that is not yet a thread safe access to prime_counter
. In order to avoid race conditions, you must do something along the lines of:
int local_prime_counter;
#pragma omp atomic read
local_prime_counter = prime_counter;
if (local_prime_counter >= max_primes)
...
#pragma omp atomic update
prime_counter++;
P.S. I'm not quite 100% sure if it is standard conforming to have a conditional ordered
construct.
Related Topics
Sending an Email from a C/C++ Program in Linux
Press Anykey to Continue in Linux C++
Windows 7 Timing Functions - How to Use Getsystemtimeadjustment Correctly
Is Right Shift Undefined Behavior If the Count Is Larger Than the Width of the Type
How to Simulate "Press Any Key to Continue"
Function Pointers Casting in C++
C++ Handling Very Large Integers
Effective C++ Item 23 Prefer Non-Member Non-Friend Functions to Member Functions
Determine If Linux or Windows in C++
Shgetknownfolderpath Equivalent API in Linux
How to Create a Single Instance Application in C or C++
Correct Use of Std::Cout.Precision() - Not Printing Trailing Zeros
Simple Illumination Correction in Images Opencv C++
Is There Any "Standard" Htonl-Like Function for 64 Bits Integers in C++