Can a Bool Read/Write Operation Be Not Atomic on X86

Can a bool read/write operation be not atomic on x86?

It all depends on what you actually mean by the word "atomic".

Do you mean "the final value will be updated in one go" (yes, on x86 that's definitely guaranteed for a byte value - and any correctly aligned value up to 64 bits at least), or "if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that).

When do I really need to use atomic bool instead of bool?

No type in C++ is "atomic by nature" unless it is an std::atomic*-something. That's because the standard says so.

In practice, the actual hardware instructions that are emitted to manipulate an std::atomic<bool> may (or may not) be the same as those for an ordinary bool, but being atomic is a larger concept with wider ramifications (e.g. restrictions on compiler re-ordering). Furthermore, some operations (like negation) are overloaded on the atomic operation to create a distinctly different instruction on the hardware than the native, non-atomic read-modify-write sequence of a non-atomic variable.

C++: Is the assignment of a value to a primitive data type (e.g. bool) an atomic operation?

Even though the C++ standard does not mandate so, it is not possible in practice to get a tearing effect "in the middle" of a bool on x86, i.e. change the value only partially while another thread is accessing it. It is, however, possible that a CPU is maintaining more than one copy of it, as a cache for each core for example. Hence, one thread could be "seeing" an older value after another thread finished changing it to a new one. std::atomic (and specifically in your case std::atomic<bool>) provides you with memory barriers to remedy this issue.

Is this the correct way to atomically read and write a bool?

Atomics are intrinsically non-portable and these are GCC extensions that might not exist anymore in the future and won't work on other compilers.

You should read the rest of the answer only once you understood completely the above statement.

A notable fact is that ALL machines existing have always guaranteed that access to data up to a certain size is atomic. This comes from the basic notion that data in memory and in the system bus is transfered with a certain granularity. A boolean should definitively be atomic in most machines, so:

bool ATOMIC_BOOL_READ(volatile bool* b) {
    bool v = *b;
    __sync_synchronize(); // ensure value pushed to memory
    return v;
}

void ATOMIC_BOOL_WRITE(volatile bool* b, bool v) {
    __sync_synchronize(); // read will return fresh value
   *b = v;
}

That is probably the reason why GCC does not provide simple load/store special atomic operations: they are already supposed to be atomic.

std::atomic bool execution guarantee?

It is a basic assumption that the compiler and processor ensures that the programmed operations are executed. This has nothing to do with std::atomic<>. The guarantee which std::atomic<> offers is that single operations happen atomically.

So what does that mean?

Consider two threads, A and B, which both increment the same integer variable. This operation typically involves reading the integer, adding 1 and writing the result (notice that this is only an example, the C++ standard does not say anything about how operations are broken into atomic steps and thus we cannot make any assumptions about it based on the standard).

In this case we would have the steps "read", "add one" and "write". For each thread, these steps are guaranteed to the executed in that order, but there is no guarantee how the steps are interleaved. It may be:

   B: read
   B: add1
   B: write
   A: read
   A: add1
   A: write

which results in the integer being incremented twice.

It could also be

   A: read
   A: add1
   B: read
   B: add1
   B: write
   A: write

which would result in the integer being incremented only once.

Thus this implementation would have a race condition.

To get rid of that we can use a std::atomic<int> instead of a plain int for the integer. std::atomic<int> implements the ++ operator and the guarantee which std::atomic<> provides is that incrementing in this case will happen atomically.

In the example, this means that the sequence of steps - read, add one, write - will not be interrupted by another thread. There is still no guarantee about the order of execution between the threads. Hence, we could have

   A: read
   A: add1
   A: write
   B: read
   B: add1
   B: write

   B: read
   B: add1
   B: write
   A: read
   A: add1
   A: write

but other combinations will not be possible and in both cases the integer will be incremented twice. Thus there is no race condition.

Do I have to use atomic bool for exit bool variable?

Do I have to use atomic for “exit” bool variable?

Yes.

Either use atomic<bool>, or use manual synchronization through (for instance) an std::mutex. Your program currently contains a data race, with one thread potentially reading a variable while another thread is writing it. This is Undefined Behavior.

Per Paragraph 1.10/21 of the C++11 Standard:

The execution of a program contains a data race if it contains two conflicting actions in different threads,
at least one of which is not atomic, and neither happens before the other. Any such data race results in
undefined behavior.

The definition of "conflicting" is given in Paragraph 1.10/4:

Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one
accesses or modifies the same memory location.

Why is integer assignment on a naturally aligned variable atomic on x86?

"Natural" alignment means aligned to its own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. page, cache-line, or an even narrower chunk size used for data transfers between different caches).

CPUs often do things like cache-access, or cache-line transfers between cores, in power-of-2 sized chunks, so alignment boundaries smaller than a cache line do matter. (See @BeeOnRope's comments below). See also Atomicity on x86 for more details on how CPUs implement atomic loads or stores internally, and Can num++ be atomic for 'int num'? for more about how atomic RMW operations like atomic<int>::fetch_add() / lock xadd are implemented internally.

First, this assumes that the int is updated with a single store instruction, rather than writing different bytes separately. This is part of what std::atomic guarantees, but that plain C or C++ doesn't. It will normally be the case, though. The x86-64 System V ABI doesn't forbid compilers from making accesses to int variables non-atomic, even though it does require int to be 4B with a default alignment of 4B. For example, x = a<<16 | b could compile to two separate 16-bit stores if the compiler wanted.

Data races are Undefined Behaviour in both C and C++, so compilers can and do assume that memory is not asynchronously modified. For code that is guaranteed not to break, use C11 stdatomic or C++11 std::atomic. Otherwise the compiler will just keep a value in a register instead of reloading every time your read it, like volatile but with actual guarantees and official support from the language standard.

Before C++11, atomic ops were usually done with volatile or other things, and a healthy dose of "works on compilers we care about", so C++11 was a huge step forward. Now you no longer have to care about what a compiler does for plain int; just use atomic<int>. If you find old guides talking about atomicity of int, they probably predate C++11. When to use volatile with multi threading? explains why that works in practice, and that atomic<T> with memory_order_relaxed is the modern way to get the same functionality.

std::atomic<int> shared;  // shared variable (compiler ensures alignment)

int x;           // local variable (compiler can keep it in a register)
x = shared.load(std::memory_order_relaxed);
shared.store(x, std::memory_order_relaxed);
// shared = x;  // don't do that unless you actually need seq_cst, because MFENCE or XCHG is much slower than a simple store

Side-note: for atomic<T> larger than the CPU can do atomically (so .is_lock_free() is false), see Where is the lock for a std::atomic?. int and int64_t / uint64_t are lock-free on all the major x86 compilers, though.

Thus, we just need to talk about the behaviour of an instruction like mov [shared], eax.

TL;DR: The x86 ISA guarantees that naturally-aligned stores and loads are atomic, up to 64bits wide. So compilers can use ordinary stores/loads as long as they ensure that std::atomic<T> has natural alignment.

(But note that i386 gcc -m32 fails to do that for C11 _Atomic 64-bit types inside structs, only aligning them to 4B, so atomic_llong can be non-atomic in some cases. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4). g++ -m32 with std::atomic is fine, at least in g++5 because https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147 was fixed in 2015 by a change to the <atomic> header. That didn't change the C11 behaviour, though.)

IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".

From the "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the x86 tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)

In x86 terminology, a "word" is two 8-bit bytes. 32 bits are a double-word, or DWORD.

###Section 8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory
operations will always be carried out atomically:

Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary (This is another way of saying "natural alignment")

That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).

The rest of the section provides further guarantees for newer Intel CPUs: Pentium widens this guarantee to 64 bits.

The
Pentium processor (and newer processors since) guarantees that the
following additional memory operations will always be carried out
atomically:

Reading or writing a quadword aligned on a 64-bit boundary
(e.g. x87 load/store of a double, or cmpxchg8b (which was new in Pentium P5))
16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:

"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using
multiple memory accesses."

AMD's manual agrees with Intel's about aligned 64-bit and narrower loads/stores being atomic

So integer, x87, and MMX/SSE loads/stores up to 64b, even in 32-bit or 16-bit mode (e.g. movq, movsd, movhps, pinsrq, extractps, etc.) are atomic if the data is aligned. gcc -m32 uses movq xmm, [mem] to implement atomic 64-bit loads for things like std::atomic<int64_t>. Clang4.0 -m32 unfortunately uses lock cmpxchg8b bug 33109.

On some CPUs with 128b or 256b internal data paths (between execution units and L1, and between different caches), 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard or easily queryable at run-time, unfortunately for compilers implementing std::atomic<__int128> or 16B structs.

(Update: x86 vendors have decided that the AVX feature bit also indicates atomic 128-bit aligned loads/stores. Before that we only had https://rigtorp.se/isatomic/ experimental testing to verify it.)

If you want atomic 128b across all x86 systems, you must use lock cmpxchg16b (available only in 64bit mode). (And it wasn't available in the first-gen x86-64 CPUs. You need to use -mcx16 with GCC/Clang for them to emit it.)

Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 (K10) with threads running on separate sockets, connected with HyperTransport.

Intel's and AMD's manuals diverge for unaligned access to cacheable memory. The common subset for all x86 CPUs is the AMD rule. Cacheable means write-back or write-through memory regions, not uncacheable or write-combining, as set with PAT or MTRR regions. They don't mean that the cache-line has to already be hot in L1 cache.

Intel P6 and later guarantee atomicity for cacheable loads/stores up to 64 bits as long as they're within a single cache-line (64B, or 32B on very old CPUs like Pentium III).
AMD guarantees atomicity for cacheable loads/stores that fit within a single 8B-aligned chunk. That makes sense, because we know from the 16B-store test on multi-socket Opteron that HyperTransport only transfers in 8B chunks, and doesn't lock while transferring to prevent tearing. (See above). I guess lock cmpxchg16b must be handled specially.

Possibly related: AMD uses MOESI to share dirty cache-lines directly between caches in different cores, so one core can be reading from its valid copy of a cache line while updates to it are coming in from another cache.

Intel uses MESIF, which requires dirty data to propagate out to the large shared inclusive L3 cache which acts as a backstop for coherency traffic. L3 is tag-inclusive of per-core L2/L1 caches, even for lines that have to be in the Invalid state in L3 because of being M or E in a per-core L1 cache. The data path between L3 and per-core caches is only 32B wide in Haswell/Skylake, so it must buffer or something to avoid a write to L3 from one core happening between reads of two halves of a cache line, which could cause tearing at the 32B boundary.

The relevant sections of the manuals:

The P6 family processors (and newer Intel processors
since) guarantee that the following additional memory operation will
always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

AMD64 Manual 7.3.2 Access Atomicity

Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor
model, as are misaligned loads or stores of less than a quadword that
are contained entirely within a naturally-aligned quadword

Notice that AMD guarantees atomicity for any load smaller than a qword, but Intel only for power-of-2 sizes. 32-bit protected mode and 64-bit long mode can load a 48 bit m16:32 as a memory operand into cs:eip with far-call or far-jmp. (And far-call pushes stuff on the stack.) IDK if this counts as a single 48-bit access or separate 16 and 32-bit.

There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the x86 tag wiki). It's not usefully skimmable since they define some symbols to express things in their own notation, and I haven't tried to really read it. IDK if it describes the atomicity rules, or if it's only concerned with memory ordering.

Atomic Read-Modify-Write

I mentioned cmpxchg8b, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).

To prevent the contents of that memory location from being modified between the load and the store, you need lock cmpxchg8b, just like you need lock inc [mem] for the entire read-modify-write to be atomic. Also note that even if cmpxchg8b without lock does a single atomic load (and optionally a store), it's not safe in general to use it as a 64b load with expected=desired. If the value in memory happens to match your expected, you'll get a non-atomic read-modify-write of that location.

The lock prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it with mov to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions like add [mem], eax.

(lock is implicit in xchg reg, [mem], so don't use xchg with mem to save code-size or instruction count unless performance is irrelevant. Only use it when you want the memory barrier and/or the atomic exchange, or when code-size is the only thing that matters, e.g. in a boot sector.)

See also: Can num++ be atomic for 'int num'?

Why `lock mov [mem], reg` doesn't exist for atomic unaligned stores

From the instruction reference manual (Intel x86 manual vol2), cmpxchg:

This instruction can be used with a LOCK prefix to allow the
instruction to be executed atomically. To simplify the interface to
the processor’s bus, the destination operand receives a write cycle
without regard to the result of the comparison. The destination
operand is written back if the comparison fails; otherwise, the source
operand is written into the destination. (The processor never produces
a locked read without also producing a locked write.)

This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for locked instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for a lock mov reg, [MMIO_PORT] to produce a write as well as a read to the memory-mapped I/O register.

The other explanation is that it's not very hard to make sure your data has natural alignment, and lock store would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could use xchg [mem], reg (XCHG has an implicit LOCK prefix), which is even slower than a hypothetical lock mov.

Using a lock prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. i.e. x86 can't do relaxed atomic RMW (without flushing the store buffer). Other ISAs can, so using .fetch_add(1, memory_order_relaxed) can be faster on non-x86.

Fun fact: Before mfence existed, a common idiom was lock add dword [esp], 0, which is a no-op other than clobbering flags and doing a locked operation. [esp] is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE as a stand-alone memory barrier, especially on AMD CPUs.

xchg [mem], reg is probably the most efficient way to implement a sequential-consistency store, vs. mov+mfence, on both Intel and AMD. mfence on Skylake at least blocks out-of-order execution of non-memory instructions, but xchg and other locked ops don't. Compilers other than gcc do use xchg for stores, even when they don't care about reading the old value.

Motivation for this design decision:

Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.

For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.

Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.

Since you linked one in the question, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. asm for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.

Atomic operation propagation/visibility (atomic load vs atomic RMW load)

After some discussion, here are my findings: First, let's define what an atomic variable's latest value means: In wall-clock time, the very latest write to an atomic variable, so, from an external observer's point of view. If there are multiple simultaneous last writes (i.e., on multiple cores during the same cycle), then it doesn't really matter which one of them is chosen.

Atomic loads of any memory order have no guarantee that the latest value is read. This means that writes have to propagate before you can access them. This propagation may be out of order with respect to the order in which they were executed, as well as differing in order with respect to different observers.
```
std::atomic_int counter = 0;

void thread()
{
    // Imagine no race between read and write.
    int value = counter.load(std::memory_order_relaxed);
    counter.store(value+1, std::memory_order_relaxed);
}

for(int i = 0; i < 1000; i++)
    std::async(thread);
```
In this example, according to my understanding of the specs, even if no read-write executions were to interfere, there could still be multiple executions of thread that read the same values, so that in the end, counter would not be 1000. This is because when using normal reads, although threads are guaranteed to read modifications to the same variable in the correct order (they will not read a new value and on the next value read an older value), they are not guaranteed to read the globally latest written value to a variable.
This creates the relativity effect (as in Einstein's physics) that every thread has its own "truth", and this is exactly why we need to use sequential consistency (or acquire/release) to restore causality: If we simply use relaxed loads, then we can even have broken causality and apparent time loops, which can happen because of instruction reordering in combination with out-of-order propagation. Memory ordering will ensure that those separate realities perceived by separate threads are at least causally consistent.
Atomic read-modify-write (RMW) operations (such as exchange, compare_exchange, fetch_add,…) are guaranteed to operate on the latest value as defined above. This means that propagation of writes is forced, and results in one universal view on the memory (if all reads you make are from atomic variables using RMW operations), independent of threads. So, if you use atomic.compare_exchange_strong(value,value, std::memory_order_relaxed) or atomic.fetch_or(0, std::memory_order_relaxed), then you are guaranteed to perceive one global order of modification that encompasses all atomic variables. Note that this does not guarantee you any ordering or causality of non-RMW reads.
```
std::atomic_int counter = 0;

void thread()
{
    // Imagine no race between read and write.
    int value = counter.fetch_or(0, std::memory_order_relaxed);
    counter.store(value+1, std::memory_order_relaxed);
}

for(int i = 0; i < 1000; i++)
    std::async(thread);
```
In this example (again, under the assumption that none of the thread() executions interfere with each other), it seems to me that the spec forbids value to contain anything but the globally latest written value. So, counter would always be 1000 in the end.

Now, when to use which kind of read?

If you only need causality within each thread (there might still be different views on what happened in which order, but at least every single reader has a causally consistent view on the world), then atomic loads and acquire/release or sequential consistency suffice.

But if you also need fresh reads (so that you must never read values other than the globally (across all threads) latest value), then you should use RMW operations for reading. Those alone do not create causality for non-atomic and non-RMW reads, but all RMW reads across all threads share the exact same view on the world, which is always up to date.

So, to conclude: Use atomic loads if different world views are allowed, but if you need an objective reality, use RMWs to load.

Can a Bool Read/Write Operation Be Not Atomic on X86