Can Num++ Be Atomic For 'Int Num'

Can num++ be atomic for 'int num'?

This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use std::atomic for reliable results, but you can use it with memory_order_relaxed if you don't care about reordering. See below for some example code and asm output using fetch_add.


But first, the assembly language part of the question:

Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?

Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.

Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See @Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.


"Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity on x86, while this answer focuses on read-modify-write.

The lock prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).

So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.

Without the lock prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.

(If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)

Note that the lock prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)


On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.



Back to C++:

It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation:

;; Valid compiler output for num++
mov eax, [num]
inc eax
mov [num], eax

This is very likely if you use the value of num later: the compiler will keep it live in a register after the increment. So even if you check how num++ compiles on its own, changing the surrounding code can affect it.

(If the value isn't needed later, inc dword [num] is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact: gcc -O3 -m32 -mtune=i586 will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).



Don't confuse the target memory model (x86) with the C++ memory model

Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.

Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.

You might be expecting that this code will synchronize with other threads:

// int flag;  is just a plain global, not std::atomic<int>.
flag--; // Pretend this is supposed to be some kind of locking attempt
modify_a_data_structure(&foo); // doesn't look at flag, and the compiler knows this. (Assume it can see the function def). Otherwise the usual don't-break-single-threaded-code rules come into play!
flag++;

But it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile.

(And no, C++ volatile is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar to volatile, but there's much more to it than that. (In practice there are similarities between volatile int to std::atomic with mo_relaxed for pure-load and pure-store operations, but not for RMWs). Also, volatile std::atomic<int> foo is not necessarily the same as std::atomic<int> foo, although current compilers don't optimize atomics (e.g. 2 back-to-back stores of the same value) so volatile atomic wouldn't change the code-gen.)

Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)


As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.

This is what GCC actually does on x86, for a few functions that operate on a std::atomic global variable.

See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.

#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
num.fetch_add(1, std::memory_order_relaxed);
}

int load_num() { return num; } // Even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
num.store(val, std::memory_order_release);
}
// Can the compiler collapse multiple atomic operations into one? No, it can't.
# g++ 6.2 -O3, targeting x86-64 System V calling convention. (First argument in edi/rdi)
inc_relaxed():
lock add DWORD PTR num[rip], 1 #### Even relaxed RMWs need a lock. There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/ There is atomic_signal_fence for ordering, but nothing for RMW.
ret
inc_seq_cst():
lock add DWORD PTR num[rip], 1
ret
load_num():
mov eax, DWORD PTR num[rip]
ret
store_num(int):
mov DWORD PTR num[rip], edi
mfence ##### seq_cst stores need an mfence
ret
store_num_release(int):
mov DWORD PTR num[rip], edi
ret ##### Release and weaker doesn't.
store_num_relaxed(int):
mov DWORD PTR num[rip], edi
ret

Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.


Re: discussion in comments on @Richard Hodges' answer about compilers merging std::atomic num++; num-=2; operations into one num--; instruction:

A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.

Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like fetch_or(0) which may be able to turn into just a load() (but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.

The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).


Getting back to num++; num-=2; compiling as if it were num--:
Compilers are allowed to do this, unless num is volatile std::atomic<int>. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (the num++ result).

I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source
(according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single lock dec dword [num] instead of lock inc dword [num] / lock sub dword [num], 2.

num++; num-- can't disappear, because it still has a Synchronizes With relationship with other threads that look at num, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of a lock add dword [num], 0 (i.e. num += 0).

As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a shared_ptr is created and destroyed, if the compiler can prove that another shared_ptr object exists for entire lifespan of the temporary.)

Even num++; num-- merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.


With current gcc6.2 and clang3.9, you still get separate locked operations even with memory_order_relaxed in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)

void multiple_ops_relaxed(std::atomic<unsigned int>& num) {
num.fetch_add( 1, std::memory_order_relaxed);
num.fetch_add(-1, std::memory_order_relaxed);
num.fetch_add( 6, std::memory_order_relaxed);
num.fetch_add(-5, std::memory_order_relaxed);
//num.fetch_add(-1, std::memory_order_relaxed);
}

multiple_ops_relaxed(std::atomic<unsigned int>&):
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
ret

Is increment an integer atomic in x86? [duplicate]

The increment-memory machine instruction on an X86 is atomic only if you use it with a LOCK prefix.

x++ in C and C++ doesn't have atomic behavior. If you do unlocked increments, due to races in which processor is reading and writing X, if two separate processors attempt an increment, you can end up with just one increment or both being seen (the second processor may have read the initial value, incremented it, and written it back after the first writes its results back).

I believe that C++11 offers atomic increments, and most vendor compilers have an idiomatic way to cause an atomic increment of certain built-in integer types (typically int and long); see your compiler reference manual.

If you want to increment a "large value" (say, a multiprecision integer), you need to do so with using some standard locking mechanism such as a semaphore.

Note that you need to worry about atomic reads, too. On the x86, reading a 32 or 64 bit value happens to be atomic if it is 64-bit word aligned. That won't be true of a "large value"; again you'll need some standard lock.

Are +=, |=, &= etc atomic? [duplicate]

You are wrong. There is no guarantee that ++ is atomic and neither is there for the compound assignment operators, or indeed for any C++ operation.

std::atomic in a union with another character

the code comments said that since a character can alias anything, we can safely operate on the atomic variable if we promise not to change the last few bits of the byte.

Those comments are wrong. char-can-alias-anything doesn't stop this from being a data race on a non-atomic variable, so it's not allowed in theory, and worse, it's actually broken when compiled by any normal compiler (like gcc, clang, or MSVC) for any normal CPU (like x86).

The unit of atomicity is the memory location, not bits within a memory location. The ISO C++11 standard defines "memory location" carefully, so adjacent elements in a char[] array or a struct are separate locations (and thus it's not a race if two threads write c[0] and c[1] without synchronization). But adjacent bit-fields in a struct are not separate memory locations, and using |= on a non-atomic char aliased to the same address as an atomic<char> is definitely the same memory location, regardless of which bits are set in the right-hand side of the |=.

For a program to be free of data-race UB, if a memory location is written by any thread, all other threads that access that memory location (potentially) simultaneously must do so with atomic operations. (And probably also through the exact same object, i.e. changing the middle byte of an atomic<int> by type-punning to atomic<char> isn't guaranteed to be safe either. On most implementations on hardware similar to "normal" modern CPUs, type-punning to a different atomic type might happen to still be atomic if atomic<int/char> are both lock-free, but memory-ordering semantics might actually be broken, especially if it isn't perfectly overlapping.

Also, union type-punning in general is not allowed in ISO C++. I think you actually need to pointer-cast to char*, not make unions with char. Union type-punning is allowed in ISO C99, and as a GNU extension in GNU C89 and GNU C++, and also in some other C++ implementations.


So that takes care of the theory, but does any of this work on current CPUs? No, it's totally unsafe in practice, too.

character |= 1 will (on normal computers) compile to asm that loads the whole char, modifies the temporary, then stores the value back. On x86 this can all happen within one memory-destination or instruction if the compiler chooses to do that (which it won't if it also wants the value later). But even so, it's still a non-atomic RMW which can step on modifications to other bits.

Atomicity is expensive and optional for read-modify-write operations, and the only way to set some bits in a byte without affecting others is a read-modify-write on current CPUs. Compilers only emit asm that does it atomically if you specifically request it. (Unlike pure stores or pure loads, which are often naturally atomic. But always use std::atomic to get the other semantics you want...)

Consider this sequence of events:

 thread A           |     thread B
-------------------|--------------
read tmp=c=0000 |
|
| c|=0b1100 # atomically, leaving c = 1100
tmp |= 1 # tmp=1 |
store c = tmp

Leaving c = 1, not the 1101 you're hoping for. i.e. the non-atomic load/store of the high bits stepped on the modification by thread B.


We get asm that can do exactly that, from compiling the source snippets from the question (on the Godbolt compiler explorer):

void t1(U &v, unsigned mask) {
// thread using the atomic
char old_value = v.atomic.load(std::memory_order_relaxed);
// with memory_order_seq_cst as the default for CAS
while (!v.atomic.compare_exchange_weak(old_value, old_value | mask)) {}

// v.atomic |= mask; // would have been easier and more efficient than CAS
}

t1(U&, unsigned int):
movzx eax, BYTE PTR [rdi] # atomic load of the old value
.L2:
mov edx, eax
or edx, esi # esi = mask (register arg)
lock cmpxchg BYTE PTR [rdi], dl # atomic CAS, uses AL implicitly as the expected value, same semantics as C++11 comp_exg seq_cst
jne .L2
ret


void t2(U &v) {
// thread using the character
v.character |= 0b1; // set the 1st bit or something
}

t2(U&):
or BYTE PTR [rdi], 1 # NON-ATOMIC RMW of the whole byte.
ret

It would be straightforward to write a program that ran v.character |= 1 in one thread, and an atomic v.atomic ^= 0b1100000 (or the equivalent with a CAS loop) in another thread.

If this code was safe, you'd always find that an even number of XOR operations modifying only the high bits left them zero. But you wouldn't find that, because the non-atomic or in the other thread might have stepped on an odd number of XOR operations. Or to make the problem easier to see, use addition with 0x10 or something, so instead of a 50% chance of being right by accident, you only have a 1 in 16 chance of the upper 4 bits being right.

This is pretty much exactly the same problem as losing counts when one of the increment operations is non-atomic.


Will the cache have to be flushed for the thread that is trying to load the atomic integer?

No, that's not how atomicity works. The problem isn't cache, it's that unless the CPU does something special, nothing stops other CPUs from reading or writing the location between when you load the old value and when you store the update value. You'd have the same problem on a multi-core system with no cache.

Of course, all systems do use cache, but cache is coherent so there's a hardware protocol (MESI) that stops different cores from simultaneously having conflicting values. When a store commits to L1D cache, it becomes globally visible. See Can num++ be atomic for 'int num'? for the details.

In C is i+=1; atomic? [duplicate]

The C standard does not define whether it is atomic or not.

In practice, you never write code which fails if a given operation is atomic, but you might well write code which fails if it isn't. So assume it isn't.

Updating an atomic int , what memory ordering do I have to use?

I think you need atomic_fetch_add here, or else you're not getting atomicity.



Related Topics



Leave a reply



Submit