How to Use Lock in Openmp

How to use lock in OpenMP?

You want the OMP_SET_LOCK/OMP_UNSET_LOCK functions:
https://hpc.llnl.gov/tuts/openMP/#OMP_SET_LOCK

Basically:

omp_lock_t writelock;

omp_init_lock(&writelock);

#pragma omp parallel for
for ( i = 0; i < x; i++ )
{
// some stuff
omp_set_lock(&writelock);
// one thread at a time stuff
omp_unset_lock(&writelock);
// some stuff
}

omp_destroy_lock(&writelock);

Most locking routines such as pthreads semaphores and sysv semaphores work on that sort of logic, although the specific API calls are different.

How to use locks in sections correctly way in OpenMP?

From my top comments:

Your lock does not guarantee ordering of your threads--only atomic access to stdout.

So, for the latter, (e.g.) you'll always get "whole line" output:

Hello\n
World\n
Bye\n

And, not:

HeWoByllo\ne\nrld\n

But, with that you can get:

World\n
Bye\n
Hello\n

Also, does default(shared) mean that p is shared? I think p needs to be private [or needs to be set inside the critical section].

Further, the 3rd section doesn't do any locking.

I just ran the program, and got Th7 on all messages, so [AFAICT], p does need to be private.


One way to solve this is with a "ticket lock" [*].

Edit: After rereading your question, the emphasis on two locks may indicate an alternate solution, which I added in the UPDATE below.

Because we can't predict what p will be, each section needs a sequence number.

For example, Hello needs 1, World needs 2, and Bye needs 3.

We need to add a shared variable that specifies which section may go next. Each section must loop on that variable [under lock] until it matches its assigned order/sequence number and increment it on the way out to allow the thread that is next in the sequence to run.

[*] A "ticket lock" is modelled on (e.g.) a bakery where you "take a number" when entering and buy stuff when the sign says: "Now serving ..." See: https://en.wikipedia.org/wiki/Ticket_lock

In general, one of the nice things about a ticket lock is that it guarantees progress and balanced access. In practice, the number of retries is relatively small/infrequent. This is particularly true if it's implemented using stdatomic.h primitives.

Anyway, here's what I came up with:

#include <omp.h>
#include <stdio.h>

int main()
{
int p;
int owner = 1;
int more = 1;

omp_lock_t lock;
omp_init_lock(&lock);

#pragma omp parallel sections default(shared) private(p) private(more)
{

#pragma omp section
{
more = 1;
while (more) {
p = omp_get_thread_num();
omp_set_lock(&lock);

if (owner == 1) {
printf("Th%d: Hello\n",p);
more = 0;
owner += 1;
}

omp_unset_lock(&lock);
}
}

#pragma omp section
{
more = 1;
while (more) {
p = omp_get_thread_num();
omp_set_lock(&lock);

if (owner == 2) {
printf("Th%d: World\n",p);
more = 0;
owner += 1;
}

omp_unset_lock(&lock);
}
}

#pragma omp section
{
more = 1;
while (more) {
p = omp_get_thread_num();
omp_set_lock(&lock);

if (owner == 3) {
printf("Th%d: Bye\n",p);
more = 0;
owner += 1;
}

omp_unset_lock(&lock);
}
}
}

omp_destroy_lock(&lock);

return 0;
}

Here's the program output. The p values can vary from run to run, but now, the ordering is always the same:

Th1: Hello
Th7: World
Th4: Bye

UPDATE:

The above solution is general [and I've used it a number of times in real production code], but it may not fit the requirement that you use two locks.

What I can think of is to have two locks that are initialized and locked by the main thread before entering any critical sections.

The first section (Hello) needs no lock and just goes first. The other two sections (World and Bye) will first lock on lock2 and lock3 respectively.

When Hello finishes, it unlocks lock2.

This allows World to get the lock and run. When finished, it unlocks both lock2 and lock3.

The unlock of lock3 allows Bye to get that lock. It prints and then unlocks lock3

Here's what I came up with for that:

#include <omp.h>
#include <stdio.h>

int main()
{
int p;

omp_lock_t lock2;
omp_init_lock(&lock2);
omp_set_lock(&lock2);

omp_lock_t lock3;
omp_init_lock(&lock3);
omp_set_lock(&lock3);

#pragma omp parallel sections default(shared) private(p)
{
#pragma omp section
{
p = omp_get_thread_num();

printf("Th%d: Hello\n",p);

omp_unset_lock(&lock2);
}

#pragma omp section
{
p = omp_get_thread_num();
omp_set_lock(&lock2);

printf("Th%d: World\n",p);

omp_unset_lock(&lock2);
omp_unset_lock(&lock3);
}

#pragma omp section
{
p = omp_get_thread_num();
omp_set_lock(&lock3);

printf("Th%d: Bye\n",p);

omp_unset_lock(&lock3);
}
}

omp_destroy_lock(&lock2);
omp_destroy_lock(&lock3);

return 0;
}

locks in OpenMP

In OpenMP specification (1.4.1 The structure of OpenMP memory model) you can read

The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.

This practically means that (as with any relaxed memory model), only at well-defined points, are threads guaranteed to have the same, consistent view on the value of shared variables. In between such points, the temporary view may be different across the threads.

In your code you handled the problem of simultaneous writing of the same variable, but there is no guarantee that an another thread reads the correct value of the variable without additional measures.

You have 3 options to do (Note that each of these solutions not only will handle simultaneous read/writes, but also provides a consistent view on the value of shared variables.):

  1. If your variable is scalar type, the best solution is to use atomic operations. This is the fastest option as atomic operations are typically supported by the hardware.
#pragma omp parallel
{
...
#pragma omp atomic read
tmp=variant;
....
#pragma omp atomic write
variant=new_value;
}

  1. Use critical construct. This solution can be used if your variable is a complex type (such as class) and its read/write cannot be performed atomically. Note that it is much less efficient (slower) than an atomic operation.
#pragma omp parallel
{
...
#pragma omp critical
tmp=variant;
....
#pragma omp critical
variant=new_value;
}

  1. Use locks for each read/write of your variable. Your code is OK for write, but have to use it for reads as well. It requires the most coding, but practically the result is the same as using the critical construct. Note that OpenMP implementations typically use locks to implement critical constructs.

How to use lock in OpenMP

DISCLAIMER: I don't have experience with the lock mechanism and took a look to the standard to learn, how this will work. I might be wrong...


At first some problems with your code: This code won't compile with a recent version of gfortran. You have to move the function walltime to the contains section of your program and you should use USE omp_lib which defines all necessary functions (and remove the resulting duplicate definitions). Additionally, you have to define your lock in the standard way:

integer(kind=OMP_LOCK_KIND), dimension(n) :: lck

Now to your question: The call to OMP_INIT_LOCK initializes your lck array to unlocked state. All threads will get a copy of this variable. Then the parallel section is started.

In the first loop, the array is initialized as something similar to a Hilbert matrix and each lock is set.
The second block is only executed by the first thread and the first lock is released. Still nothing interesting. The following loop is entered by all threads and all threads are waiting for the k-th lock, because omp_set_lock waits, till the lock is acquired. The following omp_unset_lock lets all other threads follow. Due to the already released 1st lock, all threads will immediately enter the inner loop and finally one of the threads will release the next lock. By the time, this thread releases this lock, the other threads might already be waiting for this lock.

In principle, this algorithm provides some form of synchronization, to make sure, that the data, which is required by the k+1-th loop is already calculated, when entering it.

OpenMP: Replace critical section with locks

Careless mistake: defining lock should be outside parallel block

Locking in OMP regions

The idiomatic mutex solution is to use OpenMP locks:

omp_set_lock(&taillock)
tail->next_record = r;
tail = r;
omp_unset_lock(&taillock)

and somewhere:

omp_lock_t taillock;
omp_init_lock(&taillock);

...

omp_destroy_lock(&taillock);

The simple OpenMP solution:

void start_timing(char *name) {
Record *r = create_record(name);
#pragma omp critical
{
tail->next_record = r;
tail = r;
}
return r;
}

That creates an implicit global lock bound to the source code line. For some detailed discussions see the answers to this question.

For practical purposes, using Pthread locks will also work, at least for scenarios where OpenMP is based on Pthreads.

A word of warning

Using locks in performance measurement code is dangerous. So is memory allocation, which also often implies using locks. This means, that start_time has significant cost and the performance will even get worse with more threads. That doesn't even consider the cache invalidation from having one thread allocating a chunk of memory (record) and then another thread modifying it (tail pointer).

Now that may be fine if the sections you measure take seconds, but it will cause great overhead and perturbation when your sections are only hundreds of cycles.

To create a scalable performance tracing facility, you must pre-allocate thread-local memory in larger chunks and have each thread write only to it's local part.

You can also chose to use some of the existing measurement infrastructures, such as Score-P.

Overhead & perturbation

First, distinguish between the two (linked concepts). Overhead is extra time you spend, while perturbation refers to the impact on what you measure (i.e. you now measure something different than what happens without the measurement). Overhead is undesirable in large quantities, but perturbation is much worse.

Yes, you can avoid some of the perturbation by pausing the timer during your expensive measurement runtime (the overhead remains). However, in a multi-threaded context this is still very problematic.

  • Slowing down progress in one thread, may lead to other threads waiting for it e.g. during an implicit barrier. How do you attribute the waiting time of that thread and others that follow transitively?
  • Memory allocation is usually locked - so if you allocate memory during measurement runtime, you will slow down other threads that depend on memory allocation. You could try to mitigate with memory pools, but I'd avoid the linked list in the first place.

OpenMP - Locks - Deadlock

It's quite simple. Deadlock happens if first thread locks locka and second thread locks lockb. What happens next?

First thread wants to lock lockb. However, lockb is locked by second thread, therefore first thread is blocked and waits for unlocking lockb.

Similarly, second thread wants to lock locka. However, locka is locked by first thread, therefore second thread is blocked and waits for unlocking locka.

Both thread are therefore in a blocking state and both wait for unlocking, which never happens, since the unlocking code is below line where threads are blocked.


If, in the second section, locka is locked first, the situation is different. Both threads first try to lock locka. However, only one thread is allowed to succeed; this is guaranteed by the OpenMP library. Therefore, only one thread (A) locks locka and the other thread (B) is blocked and waits until the locka is unlocked.

Thread A then executes its code, since it is not blocked. As it finally unlocks locka, thread B is unblocked.


In this scenario, deadlock means that both threads are in a blocked state and wait for each other to do something to unblock themselves. In the first case, exactly this can happen, in the second case it can't.

How to use lock array indices using OpenMP?

The locks you've found are data structures that allow only one thread executing the code within the defined code. If you'd like to protect the access on each cell of the matrix, you could go for a single lock for the whole matrix, to protect each cell, or something in between such as protecting accesses per row/column.

I would suggest you to use the omp atomic construct to each of the critical assignaments you have. The benefits of the atomic in contrast to omp critical and locks is that atomic can be implemented by the compiler/runtime by atomic processor instructions and thus increase the performance because they reduce the amount of instructions executed by a single processor (see openMP, atomic vs critical? and http://openmp.org/forum/viewtopic.php?f=3&t=1549). Another benefit using the omp atomic construct is that your code seemlessly compiles when OpenMP code is not enabled at compile time, which is not the case if you use locks.

This could be a variation of your code:

!$omp parallel 
!$omp do private(rx,ry,rz,rsqd,r2i,r6i,virij,ff,i,j) schedule(dynamic)
do i=1,m-1
do j=i+1,m
rsqd = 0.0
rx = x(i,1) - x(j,1)
ry = x(i,2) - x(j,2)
rz = x(i,3) - x(j,3)
rsqd = rsqd + rx*rx +ry*ry + rz*rz
!calculation of another variable ff
!$omp atomic
a(i,1) = a(i,1) + rx*ff
!$omp atomic
a(i,2) = a(i,2) + ry*ff
!$omp atomic
a(i,3) = a(i,3) + rz*ff
!$omp atomic
a(j,1) = a(j,1) - rx*ff
!$omp atomic
a(j,2) = a(j,2) - ry*ff
!$omp atomic
a(j,3) = a(j,3) - rz*ff
end do
end do

EDIT Since you're interested in obtaining further performance from this I'd suggest you to use some performance analysis tool (such as Vtune [1], Paraver [2], TAU [3], Scalasca [4] and even gprof). That being said, the small impact you mention when parallelizing this code may indicate that this loop represents a small fraction of the total execution runtime. Performance analysis tools would help you identify which are the most time consuming routines.

If this portion of code is representative, it may happen that the cost of creating the parallel region and waiting for it and/or the serialization due to the critical/atomic regions might limit the speedup you obtain. For the first, you might consider providing a minimum chunk to the dynamic scheduling (e.g. dynamic, 4). That would help on last iterations in which i is close to m and thus the do j loop won't represent a huge amount of work. You can also consider the guided scheduling. Its behavior is somehow similar to dynamic but the number of iterations (chunk) decreases dynamically -- i.e.

[1] https://software.intel.com/en-us/intel-vtune-amplifier-xe

[2] http://www.bsc.es/computer-sciences/performance-tools/paraver

[3] http://www.cs.uoregon.edu/research/tau/home.php

[4] http://www.scalasca.org

How to acquire and release the locks in OpenMP

The order in which readers / writers are getting the lock is not defined. In your "bad" case, you simply don't see the "Enter new data:" prompt because a reader thread is spamming the console while trying to acquire the lock.

The first step is to get rid of the output in the test_lock loop. Then just replace the test-loop with a single omp_set_lock each.

But that still doesn't guarantee order of reading/writing. If you want to implement a real producer/consumer you should take care of that with a proper queue. Note that OpenMP isn't particularly well suited to implement producer/consumer programs, so you won't get lots of help from it.



Related Topics



Leave a reply



Submit