How to Safely Use Openmp with C++11

Can I safely use OpenMP with C++11?

Walter, I believe I not only told you the current state of things in that other discussion, but also provided you with information directly from the source (i.e. from my colleague who is part of the OpenMP language committee).

OpenMP was designed as a lightweight data-parallel addition to FORTRAN and C, later extended to C++ idioms (e.g. parallel loops over random-access iterators) and to task parallelism with the introduction of explicit tasks. It is meant to be as portable across as many platforms as possible and to provide essentially the same functionality in all three languages. Its execution model is quite simple - a single-threaded application forks teams of threads in parallel regions, runs some computational tasks inside and then joins the teams back into serial execution. Each thread from a parallel team can later fork its own team if nested parallelism is enabled.

Since the main usage of OpenMP is in High Performance Computing (after all, its directive and execution model was borrowed from High Performance Fortran), the main goal of any OpenMP implementation is efficiency and not interoperability with other threading paradigms. On some platforms efficient implementation could only be achieved if the OpenMP run-time is the only one in control of the process threads. Also there are certain aspects of OpenMP that might not play well with other threading constructs, for example the limit on the number of threads set by OMP_THREAD_LIMIT when forking two or more concurrent parallel regions.

Since the OpenMP standard itself does not strictly forbid using other threading paradigms, but neither standardises the interoperability with such, supporting such functionality is up to the implementers. This means that some implementations might provide safe concurrent execution of top-level OpenMP regions, some might not. The x86 implementers pledge to supporting it, may be because most of them are also proponents of other execution models (e.g. Intel with Cilk and TBB, GCC with C++11, etc.) and x86 is usually considered an "experimental" platform (other vendors are usually much more conservative).

OpenMP 4.0 is also not going further than ISO/IEC 14882:1998 for the C++ features it employs (the SC12 draft is here). The standard now includes things like portable thread affinity - this definitely does not play well with other threading paradigms, which might provide their own binding mechanisms that clash with those of OpenMP. Once again, the OpenMP language is targeted at HPC (data and task parallel scientific and engineering applications). The C++11 constructs are targeted at general purpose computing applications. If you want fancy C++11 concurrent stuff, then use C++11 only, or if you really need to mix it with OpenMP, then stick to the C++98 subset of language features if you want to stay portable.

I'm particularly interested in the situation where I first call some code using OpenMP and then some other code using C++11 concurrency on the same data structures.

There are no obvious reasons for what you want to not be possible, but it is up to your OpenMP compiler and run-time. There are free and commercial libraries that use OpenMP for parallel execution (for example MKL), but there are always warnings (although sometimes hidden deeply in their user manuals) of possible incompatibility with multithreaded code that give information on what and when is possible. As always, this is outside of the scope of the OpenMP standard and hence YMMV.

Mixing C++11 atomics and OpenMP

Update:

OpenMP 5.0 defines the interactions to C++11 and further. Among others, it says that using the following features may result in unspecified behavior:

  • Data-dependency ordering: atomics and memory model
  • Additions to the standard library
  • C++11 library

So clearly, mixing C++11 atomics and OpenMP 5.0 will result in unspecified behavior. At least the standard itself promises that "future versions of the OpenMP specification are expected to address [these] features".

Old discussion:

Interestingly, the OpenMP 4.5 standard (2.13.6) has a rather vague reference to C++11 atomics, or more specific std::memory_order:

The intent is that, when the analogous operation exists in C++11 or
C11, a sequentially consistent atomic construct has the same semantics
as a memory_order_seq_cst atomic operation in C++11/C11. Similarly, a
non-sequentially consistent atomic construct has the same semantics as
a memory_order_relaxed atomic operation in C++11/C11.

Unfortunately this is only a note, there is nothing that defines that they are playing nicely together. In particular, even the latest OpenMP 5.0 preview still refers to C++98 as the only normative reference for C++. So technically, OpenMP doesn't even support C++11 itself.

That aside, it will probably work most of the time in practice. I would agree that using std::atomic has less potential for trouble if used together with OpenMP than C++11 threading. But if there is any trouble, it may not be as obvious. Worst case would be a atomic that doesn't operate atomically, even though I have serious trouble imagining a realistic scenario where this may happen. At the end of the day, it may not be worth it and the safest thing is to stick with pure OpenMP or pure C++11 thread/atomics.

Maybe Hristo has something to say about this, in the mean time check out this answer for a more general discussion. While a bit dated, I'm afraid it still holds.

Using OpenMP with C++11 on Mac OS

Updated Answer

Since my original answer below, the situation has improved and you can easily use OpenMP with the clang++ compiler - hurraaaay!

To do that, first use homebrew to install brew install libomp:

brew install libomp

Then when using clang++, use these flags:

clang++ -Xpreprocessor -fopenmp main.cpp -o main -lomp 

Original Answer

If you want to compile C++11 OpenMP code on OSX, the easiest way is to use gcc which you can install via homebrew.

First, check the available options:

brew options gcc

Sample Output

--with-all-languages
Enable all compilers and languages, except Ada
--with-java
Build the gcj compiler
--with-jit
Build the jit compiler
--with-nls
Build with native language support (localization)
--without-fortran
Build without the gfortran compiler
--without-multilib
Build without multilib support
--HEAD
Install HEAD version

So, I suspect you want:

brew install gcc --without-multilib --without-fortran

Once you have got it installed, you need to make sure you are using the homebrew version rather than the one Apple supplies. You need to know that homebrew installs everything in /usr/local/bin and that the C++ compiler is g++-6. So, you either need to compile with:

/usr/local/bin/g++-6 -std=c++11 -fopenmp main.cpp -o main

or, set up your PATH in your login profile:

export PATH=/usr/local/bin:$PATH

then you can just do:

g++-6 -std=c++11 -fopenmp ...

Note that if you choose the second option above (i.e. the export PATH=... option), you will either need to also type the export command in your current session once to activate it, or log out and log back in since your profile commands are only executed on login.

AFAIK, there is no need to explicitly install libiomp - not sure why you did that.

Using OpenMP with C++11 range-based for loops?

OpenMP 5.0 adds the following line on page 99, which makes a lot of range-based for loops OK !

2.12.1.3 A range-based for loop with random access iterator has a canonical loop form.

Source : https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf

Use OpenMP in C++11 to find the maximum of the calculated values

The best way to handle it is to define a custom reduction operation as shown in Gilles' answer. If your compiler only supports OpenMP 3.1 or earlier (custom reduction operations were introduced in OpenMP 4.0), then the proper solution is to perform local reduction in each thread and then sequentially combine the local reductions:

double max_calc_value = -DBL_MAX; // minimum double value
#pragma omp parallel
{
int my_i_max = -1;
double my_value = -DBL_MAX;

#pragma omp for
for (int i = 20; i < 1000; i++) {
this_value = my_slow_function(large_double_vector_array, param1*i, .., param5+i);
if (this_value > my_value){
my_value = this_value;
my_i_max = i;
}
}

#pragma omp critical
{
if (my_value > max_calc_value) {
max_calc_value = my_value;
i_max = my_i_max;
}
}
}

This minimises the synchronisation overhead from the critical construct and in a simplified way shows how the reduction clause is actually implemented.

How to properly use OpenMP?

You are on the right track - this is actually quite simple.

  • You missed k in your private clause - this would lead to issues as it is by default shared when defined outside. Instead of explicitly choosing the data sharing for each variable, the best way is to declare variables as locally as possible (e.g. for (int ...), this will almost always be desired and it's easier to reason about. a, b, c, come from the outside and are implicitly shared - the loop variables are declared inside and implicitly private.

  • Fortunately there is no need for the #pragma omp atomic. Each thread works on a different i - so no two threads could ever try to update the same c[i][j]. Removing the atomic will greatly improve performance. If you ever need atomic, also consider reduction as an alternative.

  • If you want to print omp_get_num_threads, you should do it outside of the loop, but inside the parallel region. In your case this means you have to split omp parallel for into a omp parallel and omp for. Use omp single to make sure only one thread outputs.

Be aware that very good performance from a matrix multiplication is much more complicated and beyond the scope of this question.

Edit:

For nested loops it is generally better to parallelize the outermost loop if possible - i.e. no data dependency that prevents it. There can be cases where the outermost loop does not yield enough parallelism - in those cases you would rather use collapse(2) to parallelize the outer 2 loops. Do not use (parallel) for twice unless you absolutely know what you are doing. The reason for this is that parallelizing the middle loop yields more smaller pieces of work which increases the relative overhead.

In your specific case one can safely assume TAM_MATRIZ >> n_threads0, which means the outermost loop has enough parallel work for all threads to be used efficiently.

To reiterate the data-sharing rules. For a normal parallel region.

  • Variables defined inside the lexical scope of a parallel region (and parallel loop variables) are implicitly private. Those are variables your threads work on. If a variable is only used within a lexical scope, always1 define it in the narrow-most possible lexical scope.
  • Variables defined outside of the lexical scope are implicitly shared by default. Those are variables are typically input/output to the parallel region - so it has to be shared. Make sure to avoid data races.

If you follow this, there is almost never a need to explicitly define the private/shared data-sharing attributes2.

0 Otherwise it wouldn't even make sense to use OpenMP here.

1 Exceptions apply for non-trivial C++ types with expensive ctors.

2 reduction / firstprivate are useful to be used explicitly.

Use openMP only when an argument is passed to the program

I think what you are looking for can be solved using a CPU dispatcher technique.

For benchmarking OpenMP code vs. non-OpenMP code you can create different object files from the same source code like this

//foo.c
#ifdef _OPENMP
double foo_omp() {
#else
double foo() {
#endif
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<1000000000; i++) sum += i%10;
return sum;
}

Compile like this

gcc -O3 -c foo.c
gcc -O3 -fopenmp -c foo.c -o foo_omp.o

This creates two object files foo.o and foo_omp.o. Then you can call one of these functions like this

//bar.c
#include <stdio.h>

double foo();
double foo_omp();
double (*fp)();

int main(int argc, char *argv[]) {
if(argc>1) {
fp = foo_omp;
}
else {
fp = foo;
}
double sum = fp();
printf("sum %e\n", sum);
}

Compile and link like this

gcc -O3 -fopenmp bar.c foo.o foo_omp.o

Then I time the code like this

time ./a.out -omp
time ./a.out

and the first case takes about 0.4 s and the second case about 1.2 s on my system with 4 cores/8 hardware threads.


Here is a solution which only needs a single source file

#include <stdio.h>

typedef double foo_type();

foo_type foo, foo_omp, *fp;

#ifdef _OPENMP
#define FUNCNAME foo_omp
#else
#define FUNCNAME foo
#endif

double FUNCNAME () {
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<1000000000; i++) sum += i%10;
return sum;
}

#ifdef _OPENMP
int main(int argc, char *argv[]) {
if(argc>1) {
fp = foo_omp;
}
else {
fp = foo;
}
double sum = fp();
printf("sum %e\n", sum);
}
#endif

Compile like this

gcc -O3 -c foo.c
gcc -O3 -fopenmp foo.c foo.o

Proper way to use OpenMP in find all divisors of big number

Since it is your university work I give you hints not solution (code). The problem is with the OpenMP code is that numbers are shared, and write operations to containers and container adapters from more than one thread are not required by the C++ standard to be thread safe. So you have to add an openmp directive to protect it. Alternatively, you can create a used defined reduction in openmp.

Why are you surprised that the 1st algorithm is so slow? That is the expected behaviour (it is not related to OpenMP, just to the algorithm).
ps: I have changed the 1st algorithm and measured runtimes:
A modified 1st algorithm using OpenMP (4 cores+hyperthreading) =1.4s, your second algorithm=15.5 s

EDIT: More hints:

  • How to deal with data race in OpenMP?
  • Regarding different algorithms: All divisors of a number using its prime factorization, Find divisors of any number


Related Topics



Leave a reply



Submit