Mixing C++11 Atomics and Openmp

Mixing C++11 atomics and OpenMP

Update:

OpenMP 5.0 defines the interactions to C++11 and further. Among others, it says that using the following features may result in unspecified behavior:

Data-dependency ordering: atomics and memory model
Additions to the standard library
C++11 library

So clearly, mixing C++11 atomics and OpenMP 5.0 will result in unspecified behavior. At least the standard itself promises that "future versions of the OpenMP specification are expected to address [these] features".

Old discussion:

Interestingly, the OpenMP 4.5 standard (2.13.6) has a rather vague reference to C++11 atomics, or more specific std::memory_order:

The intent is that, when the analogous operation exists in C++11 or
C11, a sequentially consistent atomic construct has the same semantics
as a memory_order_seq_cst atomic operation in C++11/C11. Similarly, a
non-sequentially consistent atomic construct has the same semantics as
a memory_order_relaxed atomic operation in C++11/C11.

Unfortunately this is only a note, there is nothing that defines that they are playing nicely together. In particular, even the latest OpenMP 5.0 preview still refers to C++98 as the only normative reference for C++. So technically, OpenMP doesn't even support C++11 itself.

That aside, it will probably work most of the time in practice. I would agree that using std::atomic has less potential for trouble if used together with OpenMP than C++11 threading. But if there is any trouble, it may not be as obvious. Worst case would be a atomic that doesn't operate atomically, even though I have serious trouble imagining a realistic scenario where this may happen. At the end of the day, it may not be worth it and the safest thing is to stick with pure OpenMP or pure C++11 thread/atomics.

Maybe Hristo has something to say about this, in the mean time check out this answer for a more general discussion. While a bit dated, I'm afraid it still holds.

Replacing #pragma omp atomic with c++ atomics

You can create a std::vector<std::atomic<double>> but you cannot change its size.

The first thing I'd do is get gsl::span or write my own variant. Then gsl::span<std::atomic<double>> is a better model for values than std::vector<std::atomic<double>>.

Once we have done that, simply remove the #pragma omp atomic and your code is atomic in c++20. In c++17 and before you have to manually implement +=.

double old = values[i];
while(!values[i].compare_exchange_weak(old, old+value))
{}

Live example.

Clang 5 generates:

omp_atomic_add(std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<std::atomic<double>, std::allocator<std::atomic<double> > >&, unsigned long, unsigned long, double): # @omp_atomic_add(std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<std::atomic<double>, std::allocator<std::atomic<double> > >&, unsigned long, unsigned long, double)
  mov rax, qword ptr [rdi]
  mov rdi, qword ptr [rax + 8*rcx]
  mov rax, qword ptr [rax + 8*rcx + 8]
  cmp rdi, rax
  jge .LBB0_6
  mov rcx, qword ptr [rsi]
.LBB0_2: # =>This Inner Loop Header: Depth=1
  cmp qword ptr [rcx + 8*rdi], r8
  je .LBB0_3
  inc rdi
  cmp rdi, rax
  jl .LBB0_2
  jmp .LBB0_6
.LBB0_3:
  mov rax, qword ptr [rdx]
  mov rax, qword ptr [rax + 8*rdi]
.LBB0_4: # =>This Inner Loop Header: Depth=1
  mov rcx, qword ptr [rdx]
  movq xmm1, rax
  addsd xmm1, xmm0
  movq rsi, xmm1
  lock
  cmpxchg qword ptr [rcx + 8*rdi], rsi
  jne .LBB0_4
.LBB0_6:
  ret

which seems identical to my casual glance.

There is a proposal for atomic_view that lets you manipulate a non-atomic value through an atomic view. In general, C++ only lets you operate atomically on atomic data.

Can std::atomic be safely used with OpenMP

Officially, no. In practice, probably.

Page Section 1.7 page 32 of the OpenMP 5.0 Specification says:

While future versions of the OpenMP specification are expected to address the following features, currently their use may result in unspecified behavior.
Concurrency
Additions to the standard library
C++11 Library

However, depending on the implementation of the OpenMP runtime you use, it might be alright. In fact, the LLVM OpenMP runtime even uses std::atomic to implement some of the OpenMP specification.

The safest option though is to stick with using only what OpenMP provides. Anything you can do using std::atomic you should also be able to achieve using only OpenMP.

Atomic access to non-atomic memory location in C++11 and OpenMP?

As far as I understand the respective standards, OpenMP has more restrictions on usage than C++11, which allows it to be portable without using special types. For example, OpenMP 4.5 says:

If the storage location designated by x is not size-aligned (that is, if the byte alignment of x is not a multiple of the size of x), then the behavior of the atomic region is implementation defined.

On the other hand, if C++11 uses std::atomic<int>, then the compiler will guarentee the appropriate alignment. In both cases, alignment is required, but OpenMP and C++11 differ in who is responsible for ensuring this is done.

Generally, there are philosophical differences between OpenMP and C++, but it's hard to enumerate all of them. The C++ folks are thinking about portability to everything, whereas OpenMP is targeted at HPC.

Can I safely use OpenMP with C++11?

Walter, I believe I not only told you the current state of things in that other discussion, but also provided you with information directly from the source (i.e. from my colleague who is part of the OpenMP language committee).

OpenMP was designed as a lightweight data-parallel addition to FORTRAN and C, later extended to C++ idioms (e.g. parallel loops over random-access iterators) and to task parallelism with the introduction of explicit tasks. It is meant to be as portable across as many platforms as possible and to provide essentially the same functionality in all three languages. Its execution model is quite simple - a single-threaded application forks teams of threads in parallel regions, runs some computational tasks inside and then joins the teams back into serial execution. Each thread from a parallel team can later fork its own team if nested parallelism is enabled.

Since the main usage of OpenMP is in High Performance Computing (after all, its directive and execution model was borrowed from High Performance Fortran), the main goal of any OpenMP implementation is efficiency and not interoperability with other threading paradigms. On some platforms efficient implementation could only be achieved if the OpenMP run-time is the only one in control of the process threads. Also there are certain aspects of OpenMP that might not play well with other threading constructs, for example the limit on the number of threads set by OMP_THREAD_LIMIT when forking two or more concurrent parallel regions.

Since the OpenMP standard itself does not strictly forbid using other threading paradigms, but neither standardises the interoperability with such, supporting such functionality is up to the implementers. This means that some implementations might provide safe concurrent execution of top-level OpenMP regions, some might not. The x86 implementers pledge to supporting it, may be because most of them are also proponents of other execution models (e.g. Intel with Cilk and TBB, GCC with C++11, etc.) and x86 is usually considered an "experimental" platform (other vendors are usually much more conservative).

OpenMP 4.0 is also not going further than ISO/IEC 14882:1998 for the C++ features it employs (the SC12 draft is here). The standard now includes things like portable thread affinity - this definitely does not play well with other threading paradigms, which might provide their own binding mechanisms that clash with those of OpenMP. Once again, the OpenMP language is targeted at HPC (data and task parallel scientific and engineering applications). The C++11 constructs are targeted at general purpose computing applications. If you want fancy C++11 concurrent stuff, then use C++11 only, or if you really need to mix it with OpenMP, then stick to the C++98 subset of language features if you want to stay portable.

I'm particularly interested in the situation where I first call some code using OpenMP and then some other code using C++11 concurrency on the same data structures.

There are no obvious reasons for what you want to not be possible, but it is up to your OpenMP compiler and run-time. There are free and commercial libraries that use OpenMP for parallel execution (for example MKL), but there are always warnings (although sometimes hidden deeply in their user manuals) of possible incompatibility with multithreaded code that give information on what and when is possible. As always, this is outside of the scope of the OpenMP standard and hence YMMV.

How does OpenMP use the atomic instruction inside reduction clause?

How does OpenMP uses atomic instruction inside reduction? Doesn't it
rely on atomic at all?

Since the OpenMP standard does not specify how the reduction clause should (or not) be implemented (e.g., based on atomic operations or not), its implementation may vary depending on each concrete implementation of the OpenMP standard.

For instance, is the variable sum in the code below accumulated with
atomic + operator?

Nonetheless, from the OpenMP standard, one can read the following:

The reduction clause can be used to perform some forms of recurrence
calculations (...) in parallel. For parallel and work-sharing constructs, a
private copy of each list item is created, one for each implicit task,
as if the private clause had been used. (...) The private copy is
then initialized as specified above. At the end of the region for
which the reduction clause was specified, the original list item is
updated by combining its original value with the final value of each
of the private copies, using the combiner of the specified
reduction-identifier.

So based on that, one can infer that the variables used on the reduction clause will be private, and consequently, will not be updated atomically. Notwithstanding, even if that was not the case it would be unlikely, though, that a concrete implementation of the OpenMP standard would rely on the atomic operation (for the instruction sum += v[i];) since (in this case) would not be the most efficient strategy. For more information on why is that the case check the following SO threads:

Why my parallel code using openMP atomic takes a longer time than serial code?;
Why should I use a reduction rather than an atomic variable?.

Very informally, a more efficient approach than using atomic would be for each thread to have their own copy of the variable sum, and at the end of the parallel region, each thread would save its copy into a resource shared among threads -- now, depending on how the reduction is implemented, atomic operations might be used to update that shared resource. That resource would then be picked up by the master thread that would reduce its content and update the original sum variable, accordingly.

More formally from OpenMP Reductions Under the Hood:

After having revisited parallel reductions in detail you might still
have some open questions about how OpenMP actually transforms your
sequential code into parallel code. In particular, you might wonder
how OpenMP detects the portion in the body of the loop that performs
the reduction. As an example, this or a similar code fragment can
often be found in code samples:
 #pragma omp parallel for reduction(+:x)
 for (int i = 0; i < n; i++)
     x -= some_value;
You could also use - as reduction operator (which is actually
redundant to +). But how does OpenMP isolate the
update step x-= some_value? The discomforting answer is that OpenMP
does not detect the update at all! The compiler treats the body of the
for-loop like this:
#pragma omp parallel for reduction(+:x)
     for (int i = 0; i < n; i++)
         x = some_expression_involving_x_or_not(x);
As a result, the modification of x could also be hidden behind an opaque > function call.
This is a comprehensible decision from the point of view of a compiler
developer. Unfortunately, this means that you have to ensure that all
updates of x are compatible with the operation defined in the
reduction clause.
The overall execution flow of a reduction can be summarized as
follows:
Spawn a team of threads and determine the set of iterations that each thread j has to perform.
Each thread declares a privatized variant of the reduction variable x initialized with the neutral element e of the corresponding
monoid.
All threads perform their iterations no matter whether or how they involve an update of the privatized variable .
The result is computed as sequential reduction over the (local) partial results and the global variable x. Finally, the result is
written back to x.

OpenMP and shared_ptr

Your analysis is quite correct. First, take a look at this question about OpenMP and std::atomic. Note that, std::shared_ptr isn't necessarily implemented using atomics. Also this applies to the shared control block, which is modified with during copy operations. There are a couple of cases:

Calling get / operator-> / operator* with one shared_ptr per thread that points to the same object. Only performing read-only operations on the target object. This is as safe as it gets given the specification-gap between C++11 and OpenMP. No control-block operations are performed. I would argue that this is not different from using a raw pointer.
Calling get / operator-> / operator* on one shared shared_ptr from multiple threads. This is still similarly safe.
Copying / deleting thread-local shared_ptrs that points to different objects across multiple threads. This should still be as safe as it gets as there is no shared date.
Copying / deleting thread-local shared_ptrs that point to the same object from multiple threads. Now we know this involves shared control date, but it is safe according to the C++ standard. The argument for std::atomic / OpenMP applies. It is practically safe but not very well defined.
Modifying (reset) a thread-shared shared_ptr across mutiple threads. This is unsafe. atomic<shared_ptr> can be used here, but then the same argument applies.

I would make one more distinction. If you consider using std::atomic with OpenMP there is the alternative of using the OpenMP-idiomatic pragma omp atomic - there is no OpenMP equivalent to shared_ptr. So short of implementing your own shared_ptr on top of omp atomic you don't really have a choice.

atomic operations on a variable which is in OpenMP reduction clause

A simple + reduction won't work for two integers that aren't exactly summed independently, but since OpenMP 4.0 you can declare your own reductions. All you need to do is abstract the two parts of the counter in a class (or struct) and define a function that sums such objects. In the example below, an overloaded compound assignment operator (+=) is used:

#include <limits>
#include <iostream>
#include <omp.h>

using namespace std;

const long int MAX = std::numeric_limits<int>::max();
const long int K = MAX + 20L;

class large_count {
   int count, hit;
public:
   large_count() : count(0), hit(0) {}

   // Prefix increment operator
   large_count& operator++() {
      hit++;
      if (hit == MAX) {
         hit = 0;
         count++;
      }
      return *this;
   }

   // Compound assignment operator
   large_count& operator+=(const large_count& other) {
      count += other.count;
      long int sum_hit = (long)hit + other.hit;
      if (sum_hit >= MAX) {
         count++;
         hit = sum_hit - MAX;
      }
      else
         hit = sum_hit;
      return *this;
   }

   long total() const { return hit + count * MAX; }
};

#pragma omp declare reduction (large_sum : large_count : omp_out += omp_in)

int main() {
   large_count cnt;
   double t = -omp_get_wtime();
   #pragma omp parallel for reduction(large_sum : cnt)
   for (long int i = 0; i < K; i++)
      ++cnt;
   t += omp_get_wtime();
   cout << (cnt.total() == K ? "YES" : "NO") << endl;
   cout << t << " s" << endl;
}

The custom reduction is declared using:

#pragma omp declare reduction (large_sum : large_count : omp_out += omp_in)

There are three parts of the declaration:

large_sum - this is the name given to the custom reduction operation
large_count - this is type that the reduction operates on
omp_out += omp_in - this is the combiner expression. omp_out and omp_in are special pseudo-variables provided by the OpenMP runtime. They are both of type large_count. The combiner expression has to combine the two values and update the value of omp_out.

Sample output:

$ g++ --version
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
...
$ g++ -std=c++11 -fopenmp -o cnt cnt.cc
$ OMP_NUM_THREADS=1 ./cnt
YES
9.39628 s
$ OMP_NUM_THREADS=3 ./cnt
YES
3.79765 s

Mixing C++11 Atomics and Openmp