What's the Difference Between "Static" and "Dynamic" Schedule in Openmp

What's the difference between static and dynamic schedule in OpenMP?

Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.

static schedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.

Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:

|             | core 0 | thread 0 |
| socket 0 | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
| | core 3 | thread 3 |

| | core 4 | thread 4 |
| socket 1 | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
| | core 7 | thread 7 |

Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:

char *a = (char *)malloc(8*4096);

#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
memset(&a[i*4096], 0, 4096);

4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a. The malloc() call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of malloc is used, e.g. one that zeroes the memory like calloc() does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touch NUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.

|             | core 0 | thread 0 | a[0]     ... a[4095]
| socket 0 | core 1 | thread 1 | a[4096] ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192] ... a[12287]
| | core 3 | thread 3 | a[12288] ... a[16383]

| | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1 | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
| | core 7 | thread 7 | a[28672] ... a[32768]

Now lets run another loop like this:

#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
memset(&a[i*4096], 1, 4096);

Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.

Now imagine that another scheduling scheme is used for the second loop: schedule(static,2). This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):

|             | core 0 | thread 0 | a[0]     ... a[8191]  <- OK, same memory node
| socket 0 | core 1 | thread 1 | a[8192] ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
| | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory

| | core 4 | thread 4 | <idle>
| socket 1 | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
| | core 7 | thread 7 | <idle>

Two bad things happen here:

  • threads 4 to 7 remain idle and half of the compute capability is lost;
  • threads 2 and 3 access non-local memory and it will take them about twice as much time to finish during which time threads 0 and 1 will remain idle.

So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.

dynamic scheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:

$ cat dyn.c
#include <stdio.h>
#include <omp.h>

int main (void)
{
int i;

#pragma omp parallel num_threads(8)
{
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[1] iter %0d, tid %0d\n", i, omp_get_thread_num());

#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[2] iter %0d, tid %0d\n", i, omp_get_thread_num());
}

return 0;
}

$ icc -openmp -o dyn.x dyn.c

$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4

(same behaviour is observed when gcc is used instead)

If the sample code from the static section was run with dynamic scheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.

There is another reason to choose between static and dynamic scheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamic scheduling is the guided where smaller and smaller iteration blocks are given to each task as the work progresses.

Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime) clause. With runtime scheduling the type is taken from the content of the environment variable OMP_SCHEDULE. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.

Difference between static and dynamic schedule in OpenMP in C

The problem is that this code is not OpenMP compliant and non-compliant programs have "unspecified" behavior. If you look at the OpenMP API V3.0 spec, section 2.5.1 Loop Construct, under the description it states:

The iteration count for each associated loop is computed before entry to the outermost
loop. If execution of any associated loop changes any of the values used to compute
any of the iteration counts then the behavior is unspecified.

The big difference between a schedule type of static and dynamic, is that with static, the chunks can be somewhat computed and scheduled to threads during compilation, while with dynamic it is done during run-time (requiring more locking).

OpenMP Dynamic vs Guided Scheduling


What affects the runtime of guided scheduling?

There are three effects to consider:

1. Load balance

The whole point of dynamic/guided scheduling is to improve the distribution of work in the case where not each loop iteration contains the same amount of work. Fundamentally:

  • schedule(dynamic, 1) provides optimal load balancing
  • dynamic, k will always have same or better load balancing than guided, k

The standard mandates that the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team,
decreasing to k.

The GCC OpenMP implemntation takes this literally, ignoring the proportional. For example, for 4 threads, k=1, it will 32 iterations as 8, 6, 5, 4, 3, 2, 1, 1, 1, 1. Now IMHO this is really stupid: It result in a bad load imbalance if the first 1/n iterations contain more than 1/n of the work.

Specific examples? Why is it slower than dynamic in some cases?

Ok, lets take a look at a trivial example where the inner work decreases with the loop iteration:

#include <omp.h>

void work(long ww) {
volatile long sum = 0;
for (long w = 0; w < ww; w++) sum += w;
}

int main() {
const long max = 32, factor = 10000000l;
#pragma omp parallel for schedule(guided, 1)
for (int i = 0; i < max; i++) {
work((max - i) * factor);
}
}

The execution looks like this1:

Example execution timeline

As you can see, guided does really bad here. guided will do much better for different kinds of work distributions. It is also possible to implement guided differently. The implementation in clang (which IIRC stems from Intel), is much more sophisticated. I truely do not understand the idea behind GCC's naive implementation. In my eyes it effectively defeates the purpose of dynamic load blancing if you give 1/n work to the first thread.

2. Overhead

Now each dynamic chunk has some performance impact due to accessing shared state. The overhead of guided will be slightly higher per chunk than dynamic, as there is a bit more computation to do. However, guided, k will have less total dynamic chunks than dynamic, k.

The overhead will also depend on the implementation, e.g. whether it uses atomics or locks for protecting the shared state.

3. Cache- and NUMA-effects

Let's say write to a vector of integers in your loop iteration. If every second iteration was to be executed by a different thread, every second element of the vector would be written by a different core. That's really bad because by doing so, they compete cache-lines that contain neighbouring elements (false sharing). If you have small chunk sizes and/or chunk sizes that don't align nicely to caches, you get bad performance at the "edges" of chunks. This is why you usually prefer large nice (2^n) chunk sizes. guided can give you larger chunk sizes on average, but not 2^n (or k*m).

This answer (that you already referenced), discusses at length the disadvantage of dynamic/guided scheduling in terms of NUMA, but this also applies to locality/caches.

Don't guess, measure

Given the various factors and difficulty to predict specifics, I can only recommend to measure your specific application, on your specific system, in your specific configuration, with your specific compiler. Unfortunately there is no perfect performance portability. I would personally argue, that this is particularly true for guided.

When would I favor guided over dynamic or vice-versa?

If you have specific knowledge about the overhead / work-per-iteration, I would say that dynamic, k gives you the most stable results by chosing a good k. In particular, you do not depend so much on how clever the implementation is.

On the other hand, guided may be a good first guess, with a reasonable overhead / load balancing ratio, at least for a clever implementation. Be particularly careful with guided if you know that later iteration times are shorter.

Keep in mind, there is also schedule(auto), which gives complete control to the compiler/runtime, and schedule(runtime), which allows you to select the scheduling policy during runtime.

Once this has been explained, do the sources above support your explanation? Do they contradict at all?

Take the sources, including this anser, with a grain of salt. Neither the chart you posted, nor my timeline picture are scientifically accurate numbers. There is a huge variance in results and there are no error bars, they would probably just be all over the place with these very few data points. Also the chart combines the multiple effects I mentioned without disclosing the Work code.

[From Intel docs]

By default the chunk size is approximately loop_count/number_of_threads.

This contradicts my observation that icc handles my small example much better.

1: Using GCC 6.3.1, Score-P / Vampir for visualization, two alternating work functions for coloring.

Is dynamic scheduling better or static scheduling (Parallel Programming)?

Your results are not that revelant to notice a strong difference between the dynamic and static approach scheduling. I find measuring speedup more appropriate in your context where you want to see the behaviour of your parallel scalability. You can also use different metrics such as weak and strong scaling.

You hardly reach a speedup of 2 using both scheduling with the coarse grained approach. This is not enough to conclude anything. Moreover,
you cannot analyze your results from your fine grained implementation since you have no parallel gain from it (this can be explained by the poor workload you have for each thread). Get good parallel scalability first.

Generally I choose the static or dynamic scheduling depending on the type of computations I am working on :

  • Static scheduling where computation workload is regular (the same for each thread) such as basic image convolution, naive matrix computation. For instance, using static scheduling for gaussian filter should be the best option.

  • Dynamic scheduling where the computation workload is irregular such as Mandelbrot set. The way dynamic works is a little more complex (chunks are not precomputed as in static scheduling) hence some overhead might appear.

  • Guided scheduling is quite similar to dynamic scheduling but starts with large chunk size and decreases through time.

In your case, your nbody simulation implies quite regular works. So static scheduling should be more appropriate. Having good parallel scalability is sometimes empirical and depends of your context.

I recommend that in the first place, you let OpenMP choose the best scheduling and chunk size for you, then try to play with things if needed.

Why dynamic scheduling is faster than static scheduling in openmp

It's really hard to assess the performance of any piece of code, especially such a short piece one. There are many factors that contribute, starting at your specific machine. How many times did you run the code? How long was each run? You need good statistics to be able to tell definitively that something is faster/slower than something else. Even then, the same might not be true for someone else. What processor are you using, how many cores does it have, and how many threads are you scheduling?

A lot can also depends on your choice of chunk size. Perhaps the choice you're making is not optimal. If you want to find out more, perform a sweep of various values of chunk and benchmark the performance. That being said, the probable reason for your observation is that your system is being utilized differently during the two tests. Dynamic scheduling adapts better to runtime conditions.

OpenMP: scheduling performance

There are three scheduling clauses defined by OpenMP: Static, Dynamic and Guided.

  1. Static: Split the loop variable evenly between threads beforehand (at compile-time);
  2. Dynamic: Distributes the chunks as the threads finishes at run-time;
  3. Guided: As dynamic, but the chunk size decreases with each successive allocation;

The default scheduling is implementation independant (not specified in the standard). So, for your question, depending on the compiler, it may change nothing (if the default implementation is Static). Here is what happens if it changes something:

  • Static scheduling is the best for regular tasks, meaning that each iteration of the loop takes the same time. It lowers the overheads of synchronizing task distribution.
  • Dynamic scheduling is the best for irregular tasks, which means that your iterations may have different execution time. This is useful because a thread can process multiple small tasks while another can process less longer tasks.
  • Guided scheduling improves the global load balancing by diminishing the granularity of your chunks. It is more efficient to distribute small tasks at the end of your parallel for to diminish the end time difference among your threads.

Is openMp dynamic scheduling the same as LPT scheduling when tasks are sorted by processing time?

Firstly, OpenMP schedule clauses apply to loops, not tasks, so it is confusing to talk about tasks in this context since OpenMP also has tasks (and, even taskloop). For the avoidance of confusion, I will call the loop scheduled entity a "chunk", since it is a contiguous chunk of iterations.

Assuming you want to discuss the schedule clause on loops, then

  1. There is a fundamental difference between a scheduling algorithm
    like LPT which assumes prior knowledge of chunk execution time and
    any of the algorithms permitted by OpenMP, which do not require such
    knowledge.
  2. As of OpenMP 4.5 there is now a schedule modifier
    (monotonic or nonmonotonic) which can be applied to the
    dynamic schedule (as well as others), and that affects the sequences which the schedule can generate.
  3. In OpenMP 5.0 the default, undecorated, schedule(dynamic) is equivalent to schedule(nonmonotonic:dynamic) which allows sequences which which would not be possible with schedule(monotonic:dynamic), and would likely break your mapping (though, you can use schedule(monotonic:dynamic) of course!)
  4. Since all of the OpenMP scheduling behaviour is described in terms of the execution state of the machine, it is certainly possible that a sequence will be produced that is not that which you would expect, since the machine state represents ground-truth, reflecting issues like interference from other machine load, whereas a scheduling scheme like LPT is based on assumed prior knowledge of execution time which may not be reflected in reality.

You can see a discussion of schedule(nonmonotonic:dynamic) at https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Cownie.pdf

The use of OpenMP dynamic scheduling and its impact on performance - how to find detail

schedule(dynamic, 64) tells OpenMP not to assume that each inner-loop iteration takes the same time, IIRC.

So static scheduling breaks the work into ranges of u values, but one of your threads must be ending up with more total work than the rest (or takes long for some reason), delaying the total time for all threads to be finished.


It was running on a compute server with Four AMD Opterons. The machine was mostly idle at the time. The only difference between the two is the use of scheduling. I left out the time because the time is inclusive of a pre processing step that both incur but it still differs by 10 seconds.

In that case, the 1.4 vs. 1.7 CPUs utilized overall could be explained by much more utilization during the 4 / 14 sec you're talking about. You could approximate the results for the interesting part by making your program exit right before the parallel part, and profiling that. Subtract those counts from your total counts to get a very rough approximation.


IDK why the work is unbalanced; it's your code + data; G.getAdjList(u) could be taking more time, or more likely in_edge_counts[u] is larger on average for some threads.


Differences in locality for conntrib[in_edge[j]] could make a lot of difference, causing cache misses for scattered reads or cache hits when different in_edge elements were close to previous values.

When you have different cores competing for memory + last-level cache bandwidth, delay in servicing a request for a cache-line of in_edge values would be worse than a delayed response for a line of conntrib[], because the CPU needs in_edge[] data before it knows which cache lines it needs for more conntrib data. So in_edge cache misses reduce memory parallelism, maybe making that thread get less of a share of memory bandwidth.

IDK how much these effects would balance out or not. More likely your data is just not evenly distributed.


Too bad AMD doesn't have efficient AVX2 gathers, because your data is perfect for vpgatherdd ymm. Not that it would help with fairness or anything, just (on Skylake) give a probably-minor speedup vs. scalar gather.

OpenMP parallel for with static schedule

When you use static scheduling, the OpenMP implementation will have to ensure that all iterations are computed by some thread if the number of threads does not evenly divide the number iterations.

From a load balancing perspective the compiler will try to allocate roughly the same number of iterations to each thread and to avoid that one thread receives all remaining iterations that are in excess of the division. So, in your example with N=11 and four threads, the remainder will be 3 and the first three threads 0..2 will get one extra iteration instead of assign 3 extra iterations to the last thread.



Related Topics



Leave a reply



Submit