Is there any compiler barrier which is equal to asm( ::: memory ) in C++11?
re: your edit:
But I do not want to use atomic variable.
Why not? If it's for performance reasons, use them with memory_order_relaxed
and atomic_signal_fence(mo_whatever)
to block compiler reordering without any runtime overhead other than the compiler barrier potentially blocking some compile-time optimizations, depending on the surrounding code.
If it's for some other reason, then maybe atomic_signal_fence
will give you code that happens to work on your target platform. I suspect that most implementations of it do order non-atomic<>
loads and stores in practice, at least as an implementation detail, and probably effectively required if there are accesses to atomic<>
variables. So it might help in practice to avoid some actual consequences of any data-race Undefined Behaviour which would still exist. (e.g. as part of a SeqLock implementation where for efficiency you want to use non-atomic reads / writes of the shared data so the compiler can use SIMD vector copies, for example.)
See Who's afraid of a big bad optimizing compiler? on LWN for some details about the badness you can run into (like invented loads) if you only use compiler barriers to force reloads of non-atomic
variables, instead of using something with read-exactly-once semantics. (In that article, they're talking about Linux kernel code so they're using volatile
for hand-rolled load/store atomics. But in general don't do that: When to use volatile with multi threading? - pretty much never)
Sufficient for what?
Regardless of any barriers, if two threads run this function at the same time, your program has Undefined Behaviour because of concurrent access to non-atomic<>
variables. So the only way this code can be useful is if you're talking about synchronizing with a signal handler that runs in the same thread.
That would also be consistent with asking for a "compiler barrier", to only prevent reordering at compile time, because out-of-order execution and memory reordering always preserve the behaviour of a single thread. So you never need extra barrier instructions to make sure you see your own operations in program order, you just need to stop the compiler reordering stuff at compile time. See Jeff Preshing's post: Memory Ordering at Compile Time
This is what atomic_signal_fence
is for. You can use it with any std::memory_order
, just like thread_fence, to get different strengths of barrier and only prevent the optimizations you need to prevent.
...
atomic_thread_fence(memory_order_acq_rel)
did not generate any compiler barrier at all!
Totally wrong, in several ways.
atomic_thread_fence
is a compiler barrier plus whatever run-time barriers are necessary to restrict reordering in the order our loads/stores become visible to other threads.
I'm guessing you mean it didn't emit any barrier instructions when you looked at the asm output for x86. Instructions like x86's MFENCE are not "compiler barriers", they're run-time memory barriers and prevent even StoreLoad reordering at run-time. (That's the only reordering that x86 allows. SFENCE and LFENCE are only needed when using weakly-ordered (NT) stores, like MOVNTPS
(_mm_stream_ps
).)
On a weakly-ordered ISA like ARM, thread_fence(mo_acq_rel) isn't free, and compiles to an instruction. gcc5.4 uses dmb ish
. (See it on the Godbolt compiler explorer).
A compiler barrier just prevents reordering at compile time, without necessarily preventing run-time reordering. So even on ARM, atomic_signal_fence(mo_seq_cst)
compiles to no instructions.
A weak enough barrier allows the compiler to do the store to B
ahead of the store to A
if it wants, but gcc happens to decide to still do them in source order even with thread_fence(mo_acquire) (which shouldn't order stores with other stores).
So this example doesn't really test whether something is a compiler barrier or not.
Strange compiler behaviour from gcc for an example that is different with a compiler barrier:
See this source+asm on Godbolt.
#include <atomic>
using namespace std;
int A,B;
void foo() {
A = 0;
atomic_thread_fence(memory_order_release);
B = 1;
//asm volatile(""::: "memory");
//atomic_signal_fence(memory_order_release);
atomic_thread_fence(memory_order_release);
A = 2;
}
This compiles with clang the way you'd expect: the thread_fence is a StoreStore barrier, so the A=0 has to happen before B=1, and can't be merged with the A=2.
# clang3.9 -O3
mov dword ptr [rip + A], 0
mov dword ptr [rip + B], 1
mov dword ptr [rip + A], 2
ret
But with gcc, the barrier has no effect, and only the final store to A is present in the asm output.
# gcc6.2 -O3
mov DWORD PTR B[rip], 1
mov DWORD PTR A[rip], 2
ret
But with atomic_signal_fence(memory_order_release)
, gcc's output matches clang. So atomic_signal_fence(mo_release)
is having the barrier effect we expect, but atomic_thread_fence
with anything weaker than seq_cst isn't acting as a compiler barrier at all.
One theory here is that gcc knows that it's officially Undefined Behaviour for multiple threads to write to non-atomic<>
variables. This doesn't hold much water, because atomic_thread_fence
should still work if used to synchronize with a signal handler, it's just stronger than necessary.
BTW, with atomic_thread_fence(memory_order_seq_cst)
, we get the expected
# gcc6.2 -O3, with a mo_seq_cst barrier
mov DWORD PTR A[rip], 0
mov DWORD PTR B[rip], 1
mfence
mov DWORD PTR A[rip], 2
ret
We get this even with only one barrier, which would still allow the A=0 and A=2 stores to happen one after the other, so the compiler is allowed to merge them across a barrier. (Observers failing to see separate A=0 and A=2 values is a possible ordering, so the compiler can decide that's what always happens). Current compilers don't usually do this kind of optimization, though. See discussion at the end of my answer on Can num++ be atomic for 'int num'?.
Implementations for asm( nop ) in windows
I mean i don't want to add a library just to force the compiler to add a NOP.
... in a way that is independent of compiler settings (such as optimization settings) and in a way that works with all Visual C++ versions (and maybe even other compilers):
No chance: A compiler is free on how it is generating code as long as the assembler code has the behavior the C code is describing.
And because the NOP
instruction does not change the behavior of the program, the compiler is free to add it or to leave it out.
Even if you found a way to force the compiler to generate a NOP
: One update of the compiler or a Windows update modifying some file and the compiler might not generate the NOP
instruction any longer.
I can use inline asm to do this for x86 but I would like it to be portable.
As I wrote above, any way to force the compiler to write a NOP
would only work on a certain compiler version for a certain CPU.
Using inline assembly or __nop()
you might cover all compilers of a certain manufacturer (for example: all GNU C compilers or all variants of Visual C++ etc...).
Another question would be: Do you explicitly need the "official" NOP
instruction or can you live with any instruction that does nothing?
If you could live with any instruction doing (nearly) nothing, reading a global or static volatile
variable could be a replacement for NOP
:
static volatile char dummy;
...
else
{
(void)dummy;
}
This should force the compiler to add a MOV
instruction reading the variable dummy
.
Background:
If you wrote a device driver, you could link the variable dummy
to some location where reading the variable has "side-effects". Example: Reading a variable located in VGA video memory can cause influence the screen content!
Using the volatile
keyword you do not only tell the compiler that the value of the variable may change at any time, but also that reading the variable may have such effects.
This means that the compiler has to assume that not reading the variable causes the program not to work correctly. It cannot optimize away the (actually unnecessary) MOV
instruction reading the variable.
Purpose of _Compiler_barrier() on 32bit read
_Load_seq_cst_4
is an inline
function. The compiler barrier is there to block reordering with later code in the calling function this inlines into.
For example, consider reading a SeqLock. (Over-simplified from this actual implementation).
#include <atomic>
atomic<unsigned> sequence;
atomic_long value;
long seqlock_try_read() {
// this would normally be the body of a retry-loop;
unsigned seq1 = sequence;
long tmpval = value;
unsigned seq2 = sequence;
if (seq1 == seq2 && (seq1 & 1 == 0)
return tmpval;
else
// writer was modifying it, we should retry the loop
}
If we didn't block compile-time reordering, the compiler could merge both reads of sequence
into a single access, like perhaps like this
long tmpval = value;
unsigned seq1 = sequence;
unsigned seq2 = sequence;
This would defeat the locking mechanism (where the writer increments sequence
once before modifying the data, then again when it's done). Readers are entirely lockless, but it's not a "lock-free" algo because if the writer gets stuck mid-update, the readers can't read anything.
The barrier within each load
function blocks reordering with other things after inlining.
(The C++11 memory model is very weak, but the x86 memory model is strong, only allowing StoreLoad reordering. Blocking compile-time reordering with later loads/stores is sufficient to give you an acquire / sequential-consistency load at runtime. x86: Are memory barriers needed here?)
BTW, a better example might be something where some non-atomic
variables are read/written after seeing a certain value in an atomic
flag. MSVC probably already avoids reordering or merging of atomic accesses, and in the seqlock the data being protected also has to be atomic
.
Why don't compilers merge redundant std::atomic writes?
understanding GCC inline asm function
so we have to use =r output operand to let assembler to auto select a register for our variable am i correct?
Yes, but it's the compiler that does register allocation. It just fills in the %[operand]
in the asm template string as a text substitution and feeds that to the assembler.
Alternatively, you could hard-code a specific register in the asm template string, and use a register-asm local variable to make sure an "=r"
constraint picked it. Or use an "=m"
memory output operand and str
a result into it, and declare a clobber on any registers you used. But those alternatives are obviously terrible compared to just telling the compiler about how your block of asm can produce an output.
I don't understand why the comment says the return statement doesn't run:
/* This return will not be reached but is necessary to prevent compiler
warnings. */
return ulOriginalBASEPRI;
Raising the basepri
(ARM docs) to a higher number might allow an interrupt handler to run right away, before later instructions, but if that exception ever returns, execution will eventually reach the C outside the asm statement. That's the whole point of saving the old basepri
into a register and having an output operand for it, I assume.
(I had been assuming that "raise" meant higher number = more interrupts allowed. But Ross comments that it will never allow more interrupts; they're "raising the bar" = lower number = fewer interrupts allowed.)
If execution really never comes out the end of your asm, you should tell the compiler about it. There is asm goto
, but that needs a list of possible branch targets. The GCC manual says:
GCC assumes that asm execution falls through to the next statement (if this is not the case, consider using the
__builtin_unreachable()
intrinsic after the asm statement).
Failing to do this might lead to the compiler planning to do something after the asm, and then it never happening even though in the source it's before the asm.
It might be a good idea to use a "memory"
clobber to make sure the compiler has memory contents in sync with the C abstract machine. (At least for variables other than locals, which an interrupt handler might access). This is usually desirable around asm barrier instructions like dsb
, but it seems here we maybe don't care about being an SMP memory barrier, just about consistent execution after changing basepri
? I don't understand why that's necessary, but if you do then worth considering one way or another whether compile-time reordering of memory access around the asm
statement is or isn't a problem.
You'd use a third colon-separated section in the asm statement (after the inputs) : "memory"
Without that, compilers might decide to do an assignment after this asm instead of before, leaving a value just in registers.
// actual C source
global_var = 1;
uint32_t oldpri = ulPortRaiseBASEPRI();
global_var = 2;
could optimize (via dead-store elimination) into asm that worked like this
// possible asm
global_var = 2;
uint32_t oldpri = ulPortRaiseBASEPRI();
// or global_var = 2; here *instead* of before the asm
difference in mfence and asm volatile ( : : : memory )
Well, a memory barrier is only needed on architectures that have weak memory ordering. x86 and x64 don't have weak memory ordering. on x86/x64 all stores have a release fence and all loads have an acquire fence. so, you should only really need asm volatile ("" : : : "memory")
For a good overview of both Intel and AMD as well as references to the relavent manufacturer specs, see http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
Generally things like "volatile" are used on a per-field basis where loads and stores to that field are natively atomic. Where loads and stores to a field are already atomic (i.e. the "operation" in question is a load or a store to a single field and thus the entire operation is atomic) the volatile
field modifier or memory barriers are not needed on x86/x64. Portable code notwithstanding.
When it comes to "operations" that are not atomic--e.g. loads or stores to a field that is larger than a native word or loads or stores to multiple fields within an "operation"--a means by which the operation can be viewed as atomic are required regardless of CPU architecture. generally this is done by means of a synchronization primitive like a mutex. Mutexes (the ones I've used) include memory barriers to avoid issues like processor reordering so you don't have to add extra memory barrier instructions. I generally consider not using synchronization primitives a premature optimization; but, the nature of premature optimization is, of course, 97% of the time :)
Where you don't use a synchronization primitive and you're dealing with a multi-field invariant, memory barriers that ensure the processor does not reorder stores and loads to different memory locations is important.
Now, in terms of not issuing an "mfence" instruction in asm volatile but using "memory" in the clobber list. From what I've been able to read
If your assembler instructions access memory in an unpredictable fashion, add `memory' to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory.
When they say "GCC" and don't mention anything about the CPU, this means it applies to only the compiler. The lack of "mfence" means there is no CPU memory barrier. You can verify this by disassembling the resulting binary. If no "mfence" instruction is issued (depending on the target platform) then it's clear the CPU is not being told to issue a memory fence.
Depending on the platform you're on and what you're trying to do, there maybe something "better" or more clear... portability not withstanding.
memcpy for volatile arrays in gcc C on x86?
memcpy_volatile
is not expected to be atomic. ... What matters is that ifmemcpy_volatile(dest, ...)
is done before advertising thedest
pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. ...
Ok, that makes the problem solvable, you're just "publishing" the memcpy stores via release/acquire synchronization.
The buffers don't need to be volatile, then, except as one way to ensure compile-time ordering before some other volatile
store. Because volatile operations are only guaranteed ordered (at compile time) wrt. other volatile operations. Since it's not being concurrently accessed while you're storing, the possible gotchas in Who's afraid of a big bad optimizing compiler? aren't a factor.
To hack this into your hand-rolled atomics with volatile
, use GNU C asm("" ::: "memory")
as a compiler memory barrier to block compile-time reordering between the release-store and the memcpy.
volatile uint8_t *shared_var;
memcpy((char*)dest, (const char*)src, len);
asm("" ::: "memory");
shared_var = dest; // release-store
But really you're just making it inconvenient for yourself by avoiding C11 stdatomic.h
for atomic_store_explicit(&shared_var, dest, memory_order_release)
or GNU C __atomic_store_n(&shared_var, dest, __ATOMIC_RELEASE)
, which are ordered wrt. non-atomic accesses like a memcpy. Using a memory_order other than the default seq_cst
will let it compile with no overhead for x86, to the same asm you get from volatile
.
The compiler knows x86's memory ordering rules, and will take advantage of them by not using any extra barriers except for seq_cst
stores. (Atomic RMWs on x86 are always full barriers, but you can't do those using volatile
.)
Avoid RMW operations like x++
if you don't actually need atomicity for the whole operation; volatile x++
is more like atomic_store_explicit(&x, 1+atomic_load_explicit(&x, memory_order_acquire), memory_order_release);
which is a big pain to type, but often you'd want to load into a tmp variable anyway.
If you're willing to use GNU C features like asm("" ::: "memory")
, you can use its __atomic
built-ins instead, without even having to change your variable declarations like you would for stdatomic.h
.
volatile uint8_t *shared_var;
memcpy((char*)dest, (const char*)src, len);
// a release-store is ordered after all previous stuff in this thread
__atomic_store_explicit(&shared_var, dest, __ATOMIC_RELEASE);
As a bonus, doing it this way makes your code portable to non-x86 ISAs, e.g. AArch64 where it could compile the release-store to stlr
. (And no separate barrier could be that efficient.)
The key point is that there's no down-side to the generated asm for x86.
As in When to use volatile with multi threading? - never. Use atomic with memory_order_relaxed
, or with acquire
/ release
to get C-level guarantees equivalent to x86 hardware memory-ordering.
How many memory barriers do we need to implement a Peterson lock?
Nobody uses a Peterson lock on mainstream platforms because mutexes are available.
But assuming you cannot use those and you are writing code for an old X86
platform without access to modern primitives (no memory model, no mutexes, no atomic RMW operations), this algorithm might be considered.
Your implementation of the Peterson lock is incorrect (also after swapping the lines 'Mark as A' & 'Mark as B').
If you translate the Wikipedia pseudo code to C++
, the correct implementation becomes:
typedef struct {
volatile bool flag[2];
volatile int victim;
} peterson_lock_t;
void peterson_lock(peterson_lock_t &lock, int id) {
lock.flag[id] = true;
lock.victim = 1-id;
asm volatile ("mfence" ::: "memory"); // CPU #StoreLoad barrier
while (lock.flag[1-id] && lock.victim == 1-id);
}
void peterson_unlock(peterson_lock_t &lock, int id) {
asm volatile("" ::: "memory"); // compiler barrier
lock.flag[id] = false;
}
In addition to the use of volatile
on he lock
variables, the mfence
instruction (in peterson_lock
) is necessary to prevent #StoreLoad reordering.
This shows a rare case where an algorithm requires sequential consistency; i.e. operations on the lock
variables must take place in a single total order.
The use of volatile
is based on non-portable (but 'almost' correct) properties on gcc/X86
.
"'almost' correct" because even though a volatile
store on X86
is a release operation on CPU level, the compiler can still reorder operations on volatile
and non-volatile
data.
For that reason, I added a compiler barrier before resetting lock.flag[id]
in peterson_unlock
.
But it is probably a good idea to use volatile
on all data that is shared between threads using this algorithm,
because the compiler can still perform store and load operations on non-volatile
data in a CPU register only.
Note that with the use of volatile
on shared data, the compiler barrier in peterson_unlock
becomes redundant.
Related Topics
Function Prologue and Epilogue in C
C++ Threads, Std::System_Error - Operation Not Permitted
Boost Async_* Functions and Shared_Ptr'S
Implicit Conversion from Char** to Const Char**
What Happens When You Deallocate a Pointer Twice or More in C++
Blur Effect Over a Qwidget in Qt
<Iostream> VS. <Iostream.H> VS. "Iostream.H"
Vector That Can Have 3 Different Data Types C++
Sfinae Works Differently in Cases of Type and Non-Type Template Parameters
Can Std::Begin Work with Array Parameters and If So, How
Are Compound Literals Standard C++
What's the Precedence of Comma Operator Inside Conditional Operator in C++
How to Create a Shared Library with Cmake