Is There Any Compiler Barrier Which Is Equal to Asm("" ::: "Memory") in C++11

Is there any compiler barrier which is equal to asm( ::: memory ) in C++11?

re: your edit:

But I do not want to use atomic variable.

Why not? If it's for performance reasons, use them with memory_order_relaxed and atomic_signal_fence(mo_whatever) to block compiler reordering without any runtime overhead other than the compiler barrier potentially blocking some compile-time optimizations, depending on the surrounding code.

If it's for some other reason, then maybe atomic_signal_fence will give you code that happens to work on your target platform. I suspect that most implementations of it do order non-atomic<> loads and stores in practice, at least as an implementation detail, and probably effectively required if there are accesses to atomic<> variables. So it might help in practice to avoid some actual consequences of any data-race Undefined Behaviour which would still exist. (e.g. as part of a SeqLock implementation where for efficiency you want to use non-atomic reads / writes of the shared data so the compiler can use SIMD vector copies, for example.)

See Who's afraid of a big bad optimizing compiler? on LWN for some details about the badness you can run into (like invented loads) if you only use compiler barriers to force reloads of non-atomic variables, instead of using something with read-exactly-once semantics. (In that article, they're talking about Linux kernel code so they're using volatile for hand-rolled load/store atomics. But in general don't do that: When to use volatile with multi threading? - pretty much never)

Sufficient for what?

Regardless of any barriers, if two threads run this function at the same time, your program has Undefined Behaviour because of concurrent access to non-atomic<> variables. So the only way this code can be useful is if you're talking about synchronizing with a signal handler that runs in the same thread.

That would also be consistent with asking for a "compiler barrier", to only prevent reordering at compile time, because out-of-order execution and memory reordering always preserve the behaviour of a single thread. So you never need extra barrier instructions to make sure you see your own operations in program order, you just need to stop the compiler reordering stuff at compile time. See Jeff Preshing's post: Memory Ordering at Compile Time

This is what atomic_signal_fence is for. You can use it with any std::memory_order, just like thread_fence, to get different strengths of barrier and only prevent the optimizations you need to prevent.

... atomic_thread_fence(memory_order_acq_rel) did not generate any compiler barrier at all!

Totally wrong, in several ways.

atomic_thread_fence is a compiler barrier plus whatever run-time barriers are necessary to restrict reordering in the order our loads/stores become visible to other threads.

I'm guessing you mean it didn't emit any barrier instructions when you looked at the asm output for x86. Instructions like x86's MFENCE are not "compiler barriers", they're run-time memory barriers and prevent even StoreLoad reordering at run-time. (That's the only reordering that x86 allows. SFENCE and LFENCE are only needed when using weakly-ordered (NT) stores, like MOVNTPS (_mm_stream_ps).)

On a weakly-ordered ISA like ARM, thread_fence(mo_acq_rel) isn't free, and compiles to an instruction. gcc5.4 uses dmb ish. (See it on the Godbolt compiler explorer).

A compiler barrier just prevents reordering at compile time, without necessarily preventing run-time reordering. So even on ARM, atomic_signal_fence(mo_seq_cst) compiles to no instructions.

A weak enough barrier allows the compiler to do the store to B ahead of the store to A if it wants, but gcc happens to decide to still do them in source order even with thread_fence(mo_acquire) (which shouldn't order stores with other stores).

So this example doesn't really test whether something is a compiler barrier or not.

Strange compiler behaviour from gcc for an example that is different with a compiler barrier:

See this source+asm on Godbolt.

#include <atomic>
using namespace std;
int A,B;

void foo() {
  A = 0;
  atomic_thread_fence(memory_order_release);
  B = 1;
  //asm volatile(""::: "memory");
  //atomic_signal_fence(memory_order_release);
  atomic_thread_fence(memory_order_release);
  A = 2;
}

This compiles with clang the way you'd expect: the thread_fence is a StoreStore barrier, so the A=0 has to happen before B=1, and can't be merged with the A=2.

    # clang3.9 -O3
    mov     dword ptr [rip + A], 0
    mov     dword ptr [rip + B], 1
    mov     dword ptr [rip + A], 2
    ret

But with gcc, the barrier has no effect, and only the final store to A is present in the asm output.

    # gcc6.2 -O3
    mov     DWORD PTR B[rip], 1
    mov     DWORD PTR A[rip], 2
    ret

But with atomic_signal_fence(memory_order_release), gcc's output matches clang. So atomic_signal_fence(mo_release) is having the barrier effect we expect, but atomic_thread_fence with anything weaker than seq_cst isn't acting as a compiler barrier at all.

One theory here is that gcc knows that it's officially Undefined Behaviour for multiple threads to write to non-atomic<> variables. This doesn't hold much water, because atomic_thread_fence should still work if used to synchronize with a signal handler, it's just stronger than necessary.

BTW, with atomic_thread_fence(memory_order_seq_cst), we get the expected

    # gcc6.2 -O3, with a mo_seq_cst barrier
    mov     DWORD PTR A[rip], 0
    mov     DWORD PTR B[rip], 1
    mfence
    mov     DWORD PTR A[rip], 2
    ret

We get this even with only one barrier, which would still allow the A=0 and A=2 stores to happen one after the other, so the compiler is allowed to merge them across a barrier. (Observers failing to see separate A=0 and A=2 values is a possible ordering, so the compiler can decide that's what always happens). Current compilers don't usually do this kind of optimization, though. See discussion at the end of my answer on Can num++ be atomic for 'int num'?.

Implementations for asm( nop ) in windows

I mean i don't want to add a library just to force the compiler to add a NOP.

... in a way that is independent of compiler settings (such as optimization settings) and in a way that works with all Visual C++ versions (and maybe even other compilers):

No chance: A compiler is free on how it is generating code as long as the assembler code has the behavior the C code is describing.

And because the NOP instruction does not change the behavior of the program, the compiler is free to add it or to leave it out.

Even if you found a way to force the compiler to generate a NOP: One update of the compiler or a Windows update modifying some file and the compiler might not generate the NOP instruction any longer.

I can use inline asm to do this for x86 but I would like it to be portable.

As I wrote above, any way to force the compiler to write a NOP would only work on a certain compiler version for a certain CPU.

Using inline assembly or __nop() you might cover all compilers of a certain manufacturer (for example: all GNU C compilers or all variants of Visual C++ etc...).

Another question would be: Do you explicitly need the "official" NOP instruction or can you live with any instruction that does nothing?

If you could live with any instruction doing (nearly) nothing, reading a global or static volatile variable could be a replacement for NOP:

static volatile char dummy;
    ...
else
{
    (void)dummy;
}

This should force the compiler to add a MOV instruction reading the variable dummy.

Background:

If you wrote a device driver, you could link the variable dummy to some location where reading the variable has "side-effects". Example: Reading a variable located in VGA video memory can cause influence the screen content!

Using the volatile keyword you do not only tell the compiler that the value of the variable may change at any time, but also that reading the variable may have such effects.

This means that the compiler has to assume that not reading the variable causes the program not to work correctly. It cannot optimize away the (actually unnecessary) MOV instruction reading the variable.

Purpose of _Compiler_barrier() on 32bit read

_Load_seq_cst_4 is an inline function. The compiler barrier is there to block reordering with later code in the calling function this inlines into.

For example, consider reading a SeqLock. (Over-simplified from this actual implementation).

#include <atomic>
atomic<unsigned> sequence;
atomic_long  value;

long seqlock_try_read() {
    // this would normally be the body of a retry-loop;
    unsigned seq1 = sequence;
    long tmpval = value;
    unsigned seq2 = sequence;

    if (seq1 == seq2 && (seq1 & 1 == 0)
        return tmpval;
    else
        // writer was modifying it, we should retry the loop
}

If we didn't block compile-time reordering, the compiler could merge both reads of sequence into a single access, like perhaps like this

    long tmpval = value;
    unsigned seq1 = sequence;
    unsigned seq2 = sequence;

This would defeat the locking mechanism (where the writer increments sequence once before modifying the data, then again when it's done). Readers are entirely lockless, but it's not a "lock-free" algo because if the writer gets stuck mid-update, the readers can't read anything.

The barrier within each load function blocks reordering with other things after inlining.

(The C++11 memory model is very weak, but the x86 memory model is strong, only allowing StoreLoad reordering. Blocking compile-time reordering with later loads/stores is sufficient to give you an acquire / sequential-consistency load at runtime. x86: Are memory barriers needed here?)

BTW, a better example might be something where some non-atomic variables are read/written after seeing a certain value in an atomic flag. MSVC probably already avoids reordering or merging of atomic accesses, and in the seqlock the data being protected also has to be atomic.

Why don't compilers merge redundant std::atomic writes?

understanding GCC inline asm function

so we have to use =r output operand to let assembler to auto select a register for our variable am i correct?

Yes, but it's the compiler that does register allocation. It just fills in the %[operand] in the asm template string as a text substitution and feeds that to the assembler.

Alternatively, you could hard-code a specific register in the asm template string, and use a register-asm local variable to make sure an "=r" constraint picked it. Or use an "=m" memory output operand and str a result into it, and declare a clobber on any registers you used. But those alternatives are obviously terrible compared to just telling the compiler about how your block of asm can produce an output.

I don't understand why the comment says the return statement doesn't run:

   /* This return will not be reached but is necessary to prevent compiler
   warnings. */
   return ulOriginalBASEPRI;

Raising the basepri (ARM docs) to a higher number might allow an interrupt handler to run right away, before later instructions, but if that exception ever returns, execution will eventually reach the C outside the asm statement. That's the whole point of saving the old basepri into a register and having an output operand for it, I assume.

(I had been assuming that "raise" meant higher number = more interrupts allowed. But Ross comments that it will never allow more interrupts; they're "raising the bar" = lower number = fewer interrupts allowed.)

If execution really never comes out the end of your asm, you should tell the compiler about it. There is asm goto, but that needs a list of possible branch targets. The GCC manual says:

GCC assumes that asm execution falls through to the next statement (if this is not the case, consider using the __builtin_unreachable() intrinsic after the asm statement).

Failing to do this might lead to the compiler planning to do something after the asm, and then it never happening even though in the source it's before the asm.

It might be a good idea to use a "memory" clobber to make sure the compiler has memory contents in sync with the C abstract machine. (At least for variables other than locals, which an interrupt handler might access). This is usually desirable around asm barrier instructions like dsb, but it seems here we maybe don't care about being an SMP memory barrier, just about consistent execution after changing basepri? I don't understand why that's necessary, but if you do then worth considering one way or another whether compile-time reordering of memory access around the asm statement is or isn't a problem.

You'd use a third colon-separated section in the asm statement (after the inputs) : "memory"

Without that, compilers might decide to do an assignment after this asm instead of before, leaving a value just in registers.

// actual C source
  global_var = 1;
  uint32_t oldpri = ulPortRaiseBASEPRI();
  global_var = 2;

could optimize (via dead-store elimination) into asm that worked like this

// possible asm
  global_var = 2;
  uint32_t oldpri = ulPortRaiseBASEPRI();
  // or global_var = 2; here *instead* of before the asm

difference in mfence and asm volatile ( : : : memory )

Well, a memory barrier is only needed on architectures that have weak memory ordering. x86 and x64 don't have weak memory ordering. on x86/x64 all stores have a release fence and all loads have an acquire fence. so, you should only really need asm volatile ("" : : : "memory")

For a good overview of both Intel and AMD as well as references to the relavent manufacturer specs, see http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

Generally things like "volatile" are used on a per-field basis where loads and stores to that field are natively atomic. Where loads and stores to a field are already atomic (i.e. the "operation" in question is a load or a store to a single field and thus the entire operation is atomic) the volatile field modifier or memory barriers are not needed on x86/x64. Portable code notwithstanding.

When it comes to "operations" that are not atomic--e.g. loads or stores to a field that is larger than a native word or loads or stores to multiple fields within an "operation"--a means by which the operation can be viewed as atomic are required regardless of CPU architecture. generally this is done by means of a synchronization primitive like a mutex. Mutexes (the ones I've used) include memory barriers to avoid issues like processor reordering so you don't have to add extra memory barrier instructions. I generally consider not using synchronization primitives a premature optimization; but, the nature of premature optimization is, of course, 97% of the time :)

Where you don't use a synchronization primitive and you're dealing with a multi-field invariant, memory barriers that ensure the processor does not reorder stores and loads to different memory locations is important.

Now, in terms of not issuing an "mfence" instruction in asm volatile but using "memory" in the clobber list. From what I've been able to read

If your assembler instructions access memory in an unpredictable fashion, add `memory' to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory.

When they say "GCC" and don't mention anything about the CPU, this means it applies to only the compiler. The lack of "mfence" means there is no CPU memory barrier. You can verify this by disassembling the resulting binary. If no "mfence" instruction is issued (depending on the target platform) then it's clear the CPU is not being told to issue a memory fence.

Depending on the platform you're on and what you're trying to do, there maybe something "better" or more clear... portability not withstanding.

memcpy for volatile arrays in gcc C on x86?

memcpy_volatile is not expected to be atomic. ... What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. ...

Ok, that makes the problem solvable, you're just "publishing" the memcpy stores via release/acquire synchronization.

The buffers don't need to be volatile, then, except as one way to ensure compile-time ordering before some other volatile store. Because volatile operations are only guaranteed ordered (at compile time) wrt. other volatile operations. Since it's not being concurrently accessed while you're storing, the possible gotchas in Who's afraid of a big bad optimizing compiler? aren't a factor.

To hack this into your hand-rolled atomics with volatile, use GNU C asm("" ::: "memory") as a compiler memory barrier to block compile-time reordering between the release-store and the memcpy.

volatile uint8_t *shared_var;

  memcpy((char*)dest,  (const char*)src, len);
  asm("" ::: "memory");
  shared_var = dest;         // release-store

But really you're just making it inconvenient for yourself by avoiding C11 stdatomic.h for atomic_store_explicit(&shared_var, dest, memory_order_release) or GNU C __atomic_store_n(&shared_var, dest, __ATOMIC_RELEASE), which are ordered wrt. non-atomic accesses like a memcpy. Using a memory_order other than the default seq_cst will let it compile with no overhead for x86, to the same asm you get from volatile.

The compiler knows x86's memory ordering rules, and will take advantage of them by not using any extra barriers except for seq_cst stores. (Atomic RMWs on x86 are always full barriers, but you can't do those using volatile.)

Avoid RMW operations like x++ if you don't actually need atomicity for the whole operation; volatile x++ is more like atomic_store_explicit(&x, 1+atomic_load_explicit(&x, memory_order_acquire), memory_order_release); which is a big pain to type, but often you'd want to load into a tmp variable anyway.

If you're willing to use GNU C features like asm("" ::: "memory"), you can use its __atomic built-ins instead, without even having to change your variable declarations like you would for stdatomic.h.

volatile uint8_t *shared_var;

  memcpy((char*)dest,  (const char*)src, len);
  // a release-store is ordered after all previous stuff in this thread
  __atomic_store_explicit(&shared_var, dest, __ATOMIC_RELEASE);

As a bonus, doing it this way makes your code portable to non-x86 ISAs, e.g. AArch64 where it could compile the release-store to stlr. (And no separate barrier could be that efficient.)

The key point is that there's no down-side to the generated asm for x86.

As in When to use volatile with multi threading? - never. Use atomic with memory_order_relaxed, or with acquire / release to get C-level guarantees equivalent to x86 hardware memory-ordering.

How many memory barriers do we need to implement a Peterson lock?

Nobody uses a Peterson lock on mainstream platforms because mutexes are available.
But assuming you cannot use those and you are writing code for an old X86 platform without access to modern primitives (no memory model, no mutexes, no atomic RMW operations), this algorithm might be considered.

Your implementation of the Peterson lock is incorrect (also after swapping the lines 'Mark as A' & 'Mark as B').

If you translate the Wikipedia pseudo code to C++, the correct implementation becomes:

typedef struct {
    volatile bool flag[2];
    volatile int victim;
} peterson_lock_t;

void peterson_lock(peterson_lock_t &lock, int id) {
    lock.flag[id] = true;
    lock.victim = 1-id;
    asm volatile ("mfence" ::: "memory"); // CPU #StoreLoad barrier
    while (lock.flag[1-id] && lock.victim == 1-id);
}

void peterson_unlock(peterson_lock_t &lock, int id) {
    asm volatile("" ::: "memory"); // compiler barrier
    lock.flag[id] = false;
}

In addition to the use of volatile on he lock variables, the mfence instruction (in peterson_lock) is necessary to prevent #StoreLoad reordering.
This shows a rare case where an algorithm requires sequential consistency; i.e. operations on the lock variables must take place in a single total order.

The use of volatile is based on non-portable (but 'almost' correct) properties on gcc/X86.
"'almost' correct" because even though a volatile store on X86 is a release operation on CPU level, the compiler can still reorder operations on volatile and non-volatile data.

For that reason, I added a compiler barrier before resetting lock.flag[id] in peterson_unlock.

But it is probably a good idea to use volatile on all data that is shared between threads using this algorithm,
because the compiler can still perform store and load operations on non-volatile data in a CPU register only.

Note that with the use of volatile on shared data, the compiler barrier in peterson_unlock becomes redundant.

Is There Any Compiler Barrier Which Is Equal to Asm("" ::: "Memory") in C++11