Why Don't Compilers Merge Redundant Std::Atomic Writes

Why don't compilers merge redundant std::atomic writes?

The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:

  y.store(1, order);
  y.store(2, order);
  y.store(3, order); // inlining + constant-folding could produce this in real code

The standard does not guarantee that an observer spinning on y (with an atomic load or CAS) will ever see y == 2. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)

Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1 and y=3.

It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.

So why don't compilers do this optimization?

It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.

The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.

There's no C++11 std::atomic way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)

Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store(). (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.

However, there are cases where it would be very helpful, for example avoiding useless shared_ptr ref count inc/dec in a loop.

Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--; would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num.

Discussion is under way to extend the std::atomic API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:

http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?

See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)

Within the current standard, volatile atomic<int> y would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile and atomic already share some requirements, but they are different). See also std::memory_order's relationship with volatile on cppreference.

Accesses to volatile objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).

Using volatile atomic<T> mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.

I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]].

wg21/p0062 warns that even volatile atomic doesn't solve everything, and discourages its use for this purpose. It gives this example:

if(x) {
    foo();
    y.store(0);
} else {
    bar();
    y.store(0);  // release a lock before a long-running loop
    for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.

Even with volatile atomic<int> y, a compiler is allowed to sink the y.store() out of the if/else and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed or release instead of seq_cst.

volatile does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<> can also be problematic for real performance.

Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.

The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.

Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).

Can atomic loads be merged in the C++ memory model?

Yes, because we can not observe the difference!

An implementation is allowed to turn your snippet into the following (pseudo-implementation).

int __loaded_foo = foo;

int x = __loaded_foo;
int y = __loaded_foo;

The reason is that there is no way for you to observe the difference between the above, and two separate loads of foo given the guarantees of sequential-consistency.

_{Note: It is not just the compiler that can make such an optimization, the processor can simply reason that there is no way in which you can observe the difference and load the value of foo once — even though the compiler might have asked it to do it twice.}

Explanation

Given a thread that keeps on updating foo in an incremental fashion, what you are guaranteed is that y will have either the same, or a later written value, when compared to the contents of x.

// thread 1 - The Writer
while (true) {
  foo += 1;
}

// thread 2 - The Reader
while (true) {
  int x = foo;
  int y = foo;

  assert (y >= x); // will never fire, unless UB (foo has reached max value)
}

Imagine the writing thread for some reason pauses its execution on every iteration (because of a context-switch or other implementation defined reason); there is no way in which you can prove that this is what is causing both x and y to have the same value, or if it is because of a "merge optimization".

In other words, we have to potential outcomes given the code in this section:

No new value is written to foo between the two reads (x == y).
A new value is written to foo between the two reads (x < y).

Since any of the two can happen, an implementation is free to narrow down the scope to simply always execute one of them; we can in no way observe the difference.

What does the Standard say?

An implementation can make whatever changes it wants as long as we cannot observe any difference between the behavior which we expressed, and the behavior during execution.

This is covered in [intro.execution]p1:

The semantic descriptions in this International Standard define a
parameterized nondeterministic abstract machine. This International
Standard places no requirement on the structure of conforming
implementations. In particular, they need not copy or emulate the
structure of the abstract machine. Rather, conforming implementations
are required to emulate (only) the observable behavior of the abstract
machine as explained below.

Another section which makes it even more clear [intro.execution]p5:

A conforming implementation executing a well-formed program shall
produce the same observable behavior as one of the possible executions
of the corresponding instance of the abstract machine with the same
program and the same input.

Further Reading:

What exactly is the "as-if
rule"?

What about polling in a loop?

// initial state
std::atomic<int> foo = 0;

// thread 1
while (true) {
  if (foo)
    break;
}

// thread 2
foo = 1

Question: Given the reasoning in the previous sections, could an implementation simply read foo once in thread 1, and then never break out of the loop even if thread 2 writes to foo?

The answer; No.

In a sequentially-consistent environment we are guaranteed that a write to foo in thread 2 will become visible in thread 1; this means that when that write has happened, thread 1 must observe this change of state.

_{Note: An implementation can turn two reads into a single one because we cannot observe the difference (one fence is just as effective as two), but it cannot completely disregard a read that exists by itself.}

_{Note: The contents of this section is guaranteed by [atomics.order]p3-4.}

_{What if I really want to prevent this form of "optimization"?}

_{If you would like to force the implementation to actually read the value of some variable at every point where you have written it you should look into usage of volatile (note that this in no way enhances thread-safety).}

But in practice compilers don't optimize atomics, and the standards group has recommended against using volatile atomic for this kind of reason until the dust settles on this issue. See

http://wg21.link/n4455
http://wg21.link/p0062
Why don't compilers merge redundant std::atomic writes?
and a duplicate of this question, Can and does the compiler optimize out two atomic loads?

Can and does the compiler optimize out two atomic loads?

Can the compiler optimize away atomic loads?

Your implementation of run1() can be safely optimized to

void run1() {
    auto a = atomic_var.load(std::memory_order_relaxed);
    auto b = a;
   // Some code using a and b;
}

In the original program the two loads could possibly be adjacent to each other in the total order of accesses on atomic_var every time run1() is called. In that case the adjacent load() operations would return the same result.

Since that possibility cannot be excluded, the compiler is allowed to optimize away the second load(). This can be done for any memory order argument, not just for relaxed atomics.

For run2() it depends. You didn't specify /*some conditions*/. If there's something, that might have a visible side effect on the atomic variable (like an opaque function call or accessing a volatile variable, etc.) then this cannot be optimized away. Otherwise it might be possible.

Does the compiler optimize out two atomic loads?

Depends on your compiler. And possibly on the compiler options you passed in. Possibly it depends on your platform. There is some debate going on, on whether compilers should optimize atomics. There is N4455 No Sane Compiler Would Optimize Atomics and this video as a start on the topic.

GCC and clang don't optimize the two load() operations onto one at the moment.

Are these allowed optimizations in C++?

I think in theory yes and even yield does not help.

But in practice no not today but possible in the future.

See:

N4455 No Sane Compiler Would Optimize Atomics
P0062R1: When should compilers optimize atomics?

"Runtime optimization" may happen if modifications coalesce. I don't know if this situation may happen in practice or not. Anyway it is not much distinguishable from "no other threads manage to observe modified value before it changes back"

In fact, the optimization effect is equivalent to "no other threads manage to observe modified value before it changes back", no matter if it is compiler optimization, or runtime. Yield is not guaranteed to help, at least because it just "gives opportunity to reschedule", which an implementation might decide to ignore. And in theory there's no synchronization between yield and an atomic operation.

On the other hand what do you expect to achieve with this?

Why don't C++ compilers optimize away reads and writes to struct data members as opposed to distinct local variables?

This is because void process_value(double& ref_value); accepts the argument by reference. The compiler/optimizer assumes aliasing, i.e. that process_value function can change memory accessible through reference ref_value and hence that size member after the array.

The compiler assumes that because the array and size are members of one same object array_wrapper function process_value can potentially cast the reference to the first element (on the first invocation) to the reference to the object (and store it elsewhere) and cast the object to unsigned char and read or replace its entire representation. So that after the function returns the state of the object must be reloaded from memory.

When size is a stand-alone object on the stack the compiler/optimizer assumes that nothing else could possibly have a reference/pointer to it and caches it in a register.

In Chandler Carruth: Optimizing the Emergent Structures of C++ he explains why the optimizers have difficulty when calling functions accepting reference/pointer arguments. Use reference/pointer function arguments only when absolutely necessary.

If you would like to change the value the more performant option is:

double process_value(double value);

And then:

array_wrapper.arr[i] = process_value(array_wrapper.arr[i]);

This change results in optimal assembly:

.L23:
movsd xmm0, QWORD PTR [rbx]
add rbx, 8
call process_value2(double)
movsd QWORD PTR [rbx-8], xmm0
cmp rbx, rbp
jne .L23

Or:

for(double& val : arr)
    val = process_value(val);

Is gcc-c++ not optimizing atomic operations for current x86-64 processors

The reason for it is that default use of std::atomic also implies memory order

std::memory_order order = std::memory_order_seq_cst

To achieve this consistency the compiler has to tell processor to not reorder instructions. And it does by using mfence instruction.

Change your

    a = assign;

    a.store(assign, std::memory_order_relaxed);

and your output will change from

process_two():
        mov     QWORD PTR [rsp-8], 42
        mfence
        mov     rax, QWORD PTR [rsp-8]
        ret

process_two():
        mov     QWORD PTR [rsp-8], 42
        mov     rax, QWORD PTR [rsp-8]
        ret

Just as you expected it to be.

C++ atomics and memory_order with RDMA

Well, there seems to be a lack of expertise on this issue at this time. Probably the newness of the std::atomics options and the general sense of uncertainty about precisely how ARM and AMD will implement relaxed consistency make it hard for people to know the answer, and speculation isn't helpful.

As I'm understanding this, the right answers seem to be:

The entire problem won't be seen on Intel because of its TSO (total store order) policy. With TSO, because the guard gets updated after the vector it guards, hence in any total store order, the guard was updated last. Seeing the guard change then guarantees that the receiver will see the updated vector elements. Moreover, the default on AMD and ARM is likely to mimic TSO.
By explicitly declaring the RDMA memory region to have relaxed_consistency, a developer is opting for a cheaper memory model, but taking on the obligation to insert a memory fence. The most obvious way to do this is to just acquire a lock before reading the guard, then release the lock after doing so. This has a cost even if no other thread contends for the lock. First, the lock operation itself requires a few clock cycles. But more broadly, locking a random mutex will have some unknown impact on caches because the hardware must assume that the lock actually is contended for, a wait may have occurred, and values could have changed under its feet.
This will result in a cost that needs to be quantified.
Equivalently, the guard can be declared to use acquire_release consistency. Seemingly, this creates a memory fence and the prior updates used to write the vector will be visible to any reader who has seen the guard value change. Again, cost needs to be quantified.
Perhaps, one could do a fenced read at the top of the code block triggered by the predicate. This would get the fence out of the main predicate loop, so the costs of the fence would only be paid once, and only paid when the predicate actually is true.

We also need to tag our atomics as volatile in C++. In fact, C++ probably should notice when a std::atomic type is accessed, and treat that like access to a volatile. However, at present it isn't obvious that C++ compilers are implementing this policy.

What's the difference between T, volatile T, and std::atomicT?

The test seems to indicate that the sample is correct but it is not. Similar code could easily end up in production and might even run flawlessly for years.

We can start off by compiling the sample with -O3. Now, the sample hangs indefinitely. (The default is -O0, no optimization / debug-consistency, which is somewhat similar to making every variable volatile, which is the reason the test didn't reveal the code as unsafe.)

To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 -O0 based x86_64 assembly corresponding to the un-optimized working binary:

        // Thread B:
        // shared = 42;
        movq    -8(%rbp), %rax
        movq    (%rax), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L11:
        movq    -32(%rbp), %rax     # Check shared every iteration
        cmpq    $42, %rax
        jne     .L11

Thread B executes a simple store of the value 42 in shared.
Thread A reads shared for each loop iteration until the comparison indicates equality.

Now, we compare that to the -O3 outcome:

        // Thread B:
        // shared = 42;
        movq    8(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
        cmpq    $42, (%rsp)         # check shared once
        je      .L87                # and skip the infinite loop or not
.L88:
        jmp     .L88                # infinite loop
.L87:

Optimizations associated with -O3 replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out. (Unlike C, infinite loops with no side-effects or volatile accesses are undefined behaviour in C++.)

The problem is that the compiler and its optimizer are not aware of the implementation's concurrency implications. Consequently, the conclusion needs to be that shared cannot change in thread A - the loop is equivalent to dead code. (Or to put it another way, data races are UB, and the optimizer is allowed to assume that the program doesn't encounter UB. If you're reading a non-atomic variable, that must mean nobody else is writing it. This is what allows compilers to hoist loads out of loops, and similarly sink stores, which are very valuable optimizations for the normal case of non-shared variables.)

The solution requires us to communicate to the compiler that shared is involved in inter-thread communication. One way to accomplish that may be volatile. While the actual meaning of volatile varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that volatile prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of std::atomic.

With volatile int64_t shared, the generated instructions change as follows:

        // Thread B:
        // shared = 42;
        movq    24(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L87:
        movq    8(%rsp), %rax
        cmpq    $42, %rax
        jne     .L87

The loop cannot be eliminated anymore as it must be assumed that shared changed even though there's no evidence of that in the form of code. As a result, the sample now works with -O3.

If volatile fixes the issue, why would you ever need std::atomic? Two aspects relevant for lock-free code are what makes std::atomic essential: memory operation atomicity and memory order.

To build the case for load/store atomicity, we review the generated assembly compiled with GCC4.8.5 -O3 -m32 (the 32-bit version) for volatile int64_t shared:

        // Thread B:
        // shared = 42;
        movl    4(%esp), %eax
        movl    12(%eax), %eax
        movl    $42, (%eax)
        movl    $0, 4(%eax)

        // Thread A:
        // while (shared != 42) {
        // }
.L88:                               # do {
        movl    40(%esp), %eax
        movl    44(%esp), %edx
        xorl    $42, %eax
        movl    %eax, %ecx
        orl     %edx, %ecx
        jne     .L88                # } while(shared ^ 42 != 0);

For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn't affect other compilers and might not hold true for all supported platforms.

To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, std::atomic can be employed. Let's review how std::atomic affects the generated assembly. The updated sample:

#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>

int main()
{
  std::atomic<int64_t> shared;
  std::thread thread([&shared]() {
    shared.store(42, std::memory_order_relaxed);
  });
  while (shared.load(std::memory_order_relaxed) != 42) {
  }
  assert(shared.load(std::memory_order_relaxed) == 42);
  thread.join();
  return 0;
}

The generated 32-bit assembly based on GCC 10.2 (-O3: https://godbolt.org/z/8sPs55nzT):

        // Thread B:
        // shared.store(42, std::memory_order_relaxed);
        movl    $42, %ecx
        xorl    %ebx, %ebx
        subl    $8, %esp
        movl    16(%esp), %eax
        movl    4(%eax), %eax       # function arg: pointer to  shared
        movl    %ecx, (%esp)
        movl    %ebx, 4(%esp)
        movq    (%esp), %xmm0       # 8-byte reload
        movq    %xmm0, (%eax)       # 8-byte store to  shared
        addl    $8, %esp

        // Thread A:
        // while (shared.load(std::memory_order_relaxed) != 42) {
        // }
.L9:                                # do {
        movq    -16(%ebp), %xmm1       # 8-byte load from shared
        movq    %xmm1, -32(%ebp)       # copy to a dummy temporary
        movl    -32(%ebp), %edx
        movl    -28(%ebp), %ecx        # and scalar reload
        movl    %edx, %eax
        movl    %ecx, %edx
        xorl    $42, %eax
        orl     %eax, %edx
        jne     .L9                 # } while(shared.load() ^ 42 != 0);

To guarantee atomicity for loads and stores, the compiler emits an 8-byte SSE2 movq instruction (to/from the bottom half of a 128-bit SSE register). Additionally, the assembly shows that the loop remains intact even though volatile was removed.

By using std::atomic in the sample, it is guaranteed that

std::atomic loads and stores are not subject to register-based caching
std::atomic loads and stores do not allow partial values to be observed

The C++ standard doesn't talk about registers at all, but it does say:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

While that leaves room for interpretation, caching std::atomic loads across iterations, like triggered in our sample (without volatile or atomic) would clearly be a violation - the store might never become visible. Current compilers don't even optimize atomics within one block, like 2 accesses in the same iteration.

On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes without special instructions. That's why GCC is able to use movq.

atomic<T> with a large T may not be supported directly by hardware, in which case the compiler can fall back to using a mutex.

A large T (e.g. the size of 2 registers) on some platforms might require an atomic RMW operation (if the compiler doesn't simply fall back to locking), which are sometimes provided with larger size than the largest efficient pure-load / pure-store that's guaranteed atomic. (e.g. on x86-64, lock cmpxchg16, or ARM ldrexd/strexd retry loop). Single-instruction atomic RMWs (like x86 uses) internally involve a cache line lock or a bus lock. For example, older versions of clang -m32 for x86 will use lock cmpxchg8b instead of movq for 8-byte pure-load or pure-store.

What's the second aspect mentioned above and what does std::memory_order_relaxed mean?
Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for std::atomic loads and stores. std::memory_order_relaxed does not impose any particular order.

Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness.

Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer std::atomic<T> with std::memory_order_relaxed (unless the scenario calls for a specific memory order) to volatile T (and, of course, T). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.

Can std::atomic cancel out increments with decrements?

I believe it can be optimized, unless declared volatile. The reason is that for any schedule that interleaves some thread in between.

Why Don't Compilers Merge Redundant Std::Atomic Writes