Can and Does the Compiler Optimize Out Two Atomic Loads

Can and does the compiler optimize out two atomic loads?

Can the compiler optimize away atomic loads?

Your implementation of run1() can be safely optimized to

void run1() {
    auto a = atomic_var.load(std::memory_order_relaxed);
    auto b = a;
   // Some code using a and b;
}

In the original program the two loads could possibly be adjacent to each other in the total order of accesses on atomic_var every time run1() is called. In that case the adjacent load() operations would return the same result.

Since that possibility cannot be excluded, the compiler is allowed to optimize away the second load(). This can be done for any memory order argument, not just for relaxed atomics.

For run2() it depends. You didn't specify /*some conditions*/. If there's something, that might have a visible side effect on the atomic variable (like an opaque function call or accessing a volatile variable, etc.) then this cannot be optimized away. Otherwise it might be possible.

Does the compiler optimize out two atomic loads?

Depends on your compiler. And possibly on the compiler options you passed in. Possibly it depends on your platform. There is some debate going on, on whether compilers should optimize atomics. There is N4455 No Sane Compiler Would Optimize Atomics and this video as a start on the topic.

GCC and clang don't optimize the two load() operations onto one at the moment.

Can atomic loads be merged in the C++ memory model?

Yes, because we can not observe the difference!

An implementation is allowed to turn your snippet into the following (pseudo-implementation).

int __loaded_foo = foo;

int x = __loaded_foo;
int y = __loaded_foo;

The reason is that there is no way for you to observe the difference between the above, and two separate loads of foo given the guarantees of sequential-consistency.

_{Note: It is not just the compiler that can make such an optimization, the processor can simply reason that there is no way in which you can observe the difference and load the value of foo once — even though the compiler might have asked it to do it twice.}

Explanation

Given a thread that keeps on updating foo in an incremental fashion, what you are guaranteed is that y will have either the same, or a later written value, when compared to the contents of x.

// thread 1 - The Writer
while (true) {
  foo += 1;
}

// thread 2 - The Reader
while (true) {
  int x = foo;
  int y = foo;

  assert (y >= x); // will never fire, unless UB (foo has reached max value)
}

Imagine the writing thread for some reason pauses its execution on every iteration (because of a context-switch or other implementation defined reason); there is no way in which you can prove that this is what is causing both x and y to have the same value, or if it is because of a "merge optimization".

In other words, we have to potential outcomes given the code in this section:

No new value is written to foo between the two reads (x == y).
A new value is written to foo between the two reads (x < y).

Since any of the two can happen, an implementation is free to narrow down the scope to simply always execute one of them; we can in no way observe the difference.

What does the Standard say?

An implementation can make whatever changes it wants as long as we cannot observe any difference between the behavior which we expressed, and the behavior during execution.

This is covered in [intro.execution]p1:

The semantic descriptions in this International Standard define a
parameterized nondeterministic abstract machine. This International
Standard places no requirement on the structure of conforming
implementations. In particular, they need not copy or emulate the
structure of the abstract machine. Rather, conforming implementations
are required to emulate (only) the observable behavior of the abstract
machine as explained below.

Another section which makes it even more clear [intro.execution]p5:

A conforming implementation executing a well-formed program shall
produce the same observable behavior as one of the possible executions
of the corresponding instance of the abstract machine with the same
program and the same input.

Further Reading:

What exactly is the "as-if
rule"?

What about polling in a loop?

// initial state
std::atomic<int> foo = 0;

// thread 1
while (true) {
  if (foo)
    break;
}

// thread 2
foo = 1

Question: Given the reasoning in the previous sections, could an implementation simply read foo once in thread 1, and then never break out of the loop even if thread 2 writes to foo?

The answer; No.

In a sequentially-consistent environment we are guaranteed that a write to foo in thread 2 will become visible in thread 1; this means that when that write has happened, thread 1 must observe this change of state.

_{Note: An implementation can turn two reads into a single one because we cannot observe the difference (one fence is just as effective as two), but it cannot completely disregard a read that exists by itself.}

_{Note: The contents of this section is guaranteed by [atomics.order]p3-4.}

_{What if I really want to prevent this form of "optimization"?}

_{If you would like to force the implementation to actually read the value of some variable at every point where you have written it you should look into usage of volatile (note that this in no way enhances thread-safety).}

But in practice compilers don't optimize atomics, and the standards group has recommended against using volatile atomic for this kind of reason until the dust settles on this issue. See

http://wg21.link/n4455
http://wg21.link/p0062
Why don't compilers merge redundant std::atomic writes?
and a duplicate of this question, Can and does the compiler optimize out two atomic loads?

Optimizations around atomic load stores in C++

Q1

Generally, yes. Any load or store that follows (in program order) an acquire load, must not become visible before it.

Here is an example where it matters:

#include <atomic>
#include <thread>
#include <iostream>

std::atomic<int> x{0};
std::atomic<bool> finished{false};
int xval;
bool good;

void reader() {
    xval = x.load(std::memory_order_relaxed);
    finished.store(true, std::memory_order_release);
}

void writer() {
    good = finished.load(std::memory_order_acquire);
    x.store(42, std::memory_order_relaxed);
}

int main() {
    std::thread t1(reader);
    std::thread t2(writer);
    t1.join();
    t2.join();
    if (good) {
        std::cout << xval << std::endl;
    } else {
        std::cout << "too soon" << std::endl;
    }
    return 0;
}

Try on godbolt

This program is free of UB and must print either 0 or too soon. If the writer store of 42 to x could be reordered before the load of finished, then it would be possible that the reader load of x returns 42 and the writer load of finished returns true, in which case the program would improperly print 42.

Q2

Yes, it would be okay for a compiler to delete the atomic load whose value is never used, since there is no way for a conforming program to detect whether the load was done or not. However, current compilers generally don't do such optimizations. Partly out of an abundance of caution, because optimizations on atomics are hard to get right and bugs can be very subtle. It may also be partly to support programmers writing implementation-dependent code, that is able to detect via non-standard features whether the load was done.

Q3

Yes, this reordering is perfectly legal, and real-world architectures will do it. An acquire barrier is only one way.

Q4

Yes, this is also legal. If a,b are not atomic, and some other thread is reading them concurrently, then the code has a data race and is UB, so it is okay if the other thread observes the writes having happened in the wrong order (or summons nasal demons). (If they are atomic and you are doing relaxed stores, then you can't get nasal demons, but you can still observe the stores out of order; there is no happens-before relationship mandating the contrary.)

Optimization comparison

Your f versus g examples is not really a fair comparison: in g, the load of the non-atomic variable x has no side effects and its value is not used, so the compiler omits it altogether. As mentioned above, the compiler doesn't omit the unnecessary atomic load of x in f.

As to why the compilers don't sink the first accesses to a and b past the acquire load: I believe it's simply a missed optimization. Again, most compilers deliberately don't try to do all possible legal optimizations with atomics. However, you could note that on ARM64 for instance, the load of x in f compiles to ldar, which the CPU can definitely reorder with earlier plain loads and stores

Why don't compilers merge redundant std::atomic writes?

The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:

  y.store(1, order);
  y.store(2, order);
  y.store(3, order); // inlining + constant-folding could produce this in real code

The standard does not guarantee that an observer spinning on y (with an atomic load or CAS) will ever see y == 2. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)

Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1 and y=3.

It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.

So why don't compilers do this optimization?

It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.

The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.

There's no C++11 std::atomic way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)

Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store(). (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.

However, there are cases where it would be very helpful, for example avoiding useless shared_ptr ref count inc/dec in a loop.

Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--; would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num.

Discussion is under way to extend the std::atomic API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:

http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?

See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)

Within the current standard, volatile atomic<int> y would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile and atomic already share some requirements, but they are different). See also std::memory_order's relationship with volatile on cppreference.

Accesses to volatile objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).

Using volatile atomic<T> mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.

I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]].

wg21/p0062 warns that even volatile atomic doesn't solve everything, and discourages its use for this purpose. It gives this example:

if(x) {
    foo();
    y.store(0);
} else {
    bar();
    y.store(0);  // release a lock before a long-running loop
    for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.

Even with volatile atomic<int> y, a compiler is allowed to sink the y.store() out of the if/else and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed or release instead of seq_cst.

volatile does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<> can also be problematic for real performance.

Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.

The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.

Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).

What's the difference between T, volatile T, and std::atomicT?

The test seems to indicate that the sample is correct but it is not. Similar code could easily end up in production and might even run flawlessly for years.

We can start off by compiling the sample with -O3. Now, the sample hangs indefinitely. (The default is -O0, no optimization / debug-consistency, which is somewhat similar to making every variable volatile, which is the reason the test didn't reveal the code as unsafe.)

To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 -O0 based x86_64 assembly corresponding to the un-optimized working binary:

        // Thread B:
        // shared = 42;
        movq    -8(%rbp), %rax
        movq    (%rax), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L11:
        movq    -32(%rbp), %rax     # Check shared every iteration
        cmpq    $42, %rax
        jne     .L11

Thread B executes a simple store of the value 42 in shared.
Thread A reads shared for each loop iteration until the comparison indicates equality.

Now, we compare that to the -O3 outcome:

        // Thread B:
        // shared = 42;
        movq    8(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
        cmpq    $42, (%rsp)         # check shared once
        je      .L87                # and skip the infinite loop or not
.L88:
        jmp     .L88                # infinite loop
.L87:

Optimizations associated with -O3 replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out. (Unlike C, infinite loops with no side-effects or volatile accesses are undefined behaviour in C++.)

The problem is that the compiler and its optimizer are not aware of the implementation's concurrency implications. Consequently, the conclusion needs to be that shared cannot change in thread A - the loop is equivalent to dead code. (Or to put it another way, data races are UB, and the optimizer is allowed to assume that the program doesn't encounter UB. If you're reading a non-atomic variable, that must mean nobody else is writing it. This is what allows compilers to hoist loads out of loops, and similarly sink stores, which are very valuable optimizations for the normal case of non-shared variables.)

The solution requires us to communicate to the compiler that shared is involved in inter-thread communication. One way to accomplish that may be volatile. While the actual meaning of volatile varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that volatile prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of std::atomic.

With volatile int64_t shared, the generated instructions change as follows:

        // Thread B:
        // shared = 42;
        movq    24(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L87:
        movq    8(%rsp), %rax
        cmpq    $42, %rax
        jne     .L87

The loop cannot be eliminated anymore as it must be assumed that shared changed even though there's no evidence of that in the form of code. As a result, the sample now works with -O3.

If volatile fixes the issue, why would you ever need std::atomic? Two aspects relevant for lock-free code are what makes std::atomic essential: memory operation atomicity and memory order.

To build the case for load/store atomicity, we review the generated assembly compiled with GCC4.8.5 -O3 -m32 (the 32-bit version) for volatile int64_t shared:

        // Thread B:
        // shared = 42;
        movl    4(%esp), %eax
        movl    12(%eax), %eax
        movl    $42, (%eax)
        movl    $0, 4(%eax)

        // Thread A:
        // while (shared != 42) {
        // }
.L88:                               # do {
        movl    40(%esp), %eax
        movl    44(%esp), %edx
        xorl    $42, %eax
        movl    %eax, %ecx
        orl     %edx, %ecx
        jne     .L88                # } while(shared ^ 42 != 0);

For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn't affect other compilers and might not hold true for all supported platforms.

To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, std::atomic can be employed. Let's review how std::atomic affects the generated assembly. The updated sample:

#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>

int main()
{
  std::atomic<int64_t> shared;
  std::thread thread([&shared]() {
    shared.store(42, std::memory_order_relaxed);
  });
  while (shared.load(std::memory_order_relaxed) != 42) {
  }
  assert(shared.load(std::memory_order_relaxed) == 42);
  thread.join();
  return 0;
}

The generated 32-bit assembly based on GCC 10.2 (-O3: https://godbolt.org/z/8sPs55nzT):

        // Thread B:
        // shared.store(42, std::memory_order_relaxed);
        movl    $42, %ecx
        xorl    %ebx, %ebx
        subl    $8, %esp
        movl    16(%esp), %eax
        movl    4(%eax), %eax       # function arg: pointer to  shared
        movl    %ecx, (%esp)
        movl    %ebx, 4(%esp)
        movq    (%esp), %xmm0       # 8-byte reload
        movq    %xmm0, (%eax)       # 8-byte store to  shared
        addl    $8, %esp

        // Thread A:
        // while (shared.load(std::memory_order_relaxed) != 42) {
        // }
.L9:                                # do {
        movq    -16(%ebp), %xmm1       # 8-byte load from shared
        movq    %xmm1, -32(%ebp)       # copy to a dummy temporary
        movl    -32(%ebp), %edx
        movl    -28(%ebp), %ecx        # and scalar reload
        movl    %edx, %eax
        movl    %ecx, %edx
        xorl    $42, %eax
        orl     %eax, %edx
        jne     .L9                 # } while(shared.load() ^ 42 != 0);

To guarantee atomicity for loads and stores, the compiler emits an 8-byte SSE2 movq instruction (to/from the bottom half of a 128-bit SSE register). Additionally, the assembly shows that the loop remains intact even though volatile was removed.

By using std::atomic in the sample, it is guaranteed that

std::atomic loads and stores are not subject to register-based caching
std::atomic loads and stores do not allow partial values to be observed

The C++ standard doesn't talk about registers at all, but it does say:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

While that leaves room for interpretation, caching std::atomic loads across iterations, like triggered in our sample (without volatile or atomic) would clearly be a violation - the store might never become visible. Current compilers don't even optimize atomics within one block, like 2 accesses in the same iteration.

On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes without special instructions. That's why GCC is able to use movq.

atomic<T> with a large T may not be supported directly by hardware, in which case the compiler can fall back to using a mutex.

A large T (e.g. the size of 2 registers) on some platforms might require an atomic RMW operation (if the compiler doesn't simply fall back to locking), which are sometimes provided with larger size than the largest efficient pure-load / pure-store that's guaranteed atomic. (e.g. on x86-64, lock cmpxchg16, or ARM ldrexd/strexd retry loop). Single-instruction atomic RMWs (like x86 uses) internally involve a cache line lock or a bus lock. For example, older versions of clang -m32 for x86 will use lock cmpxchg8b instead of movq for 8-byte pure-load or pure-store.

What's the second aspect mentioned above and what does std::memory_order_relaxed mean?
Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for std::atomic loads and stores. std::memory_order_relaxed does not impose any particular order.

Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness.

Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer std::atomic<T> with std::memory_order_relaxed (unless the scenario calls for a specific memory order) to volatile T (and, of course, T). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.

What is the difference between load/store relaxed atomic and normal variable?

The difference is that a normal load/store is not guaranteed to be tear-free, whereas a relaxed atomic read/write is. Also, the atomic guarantees that the compiler doesn't rearrange or optimise-out memory accesses in a similar fashion to what volatile guarantees.

(Pre-C++11, volatile was an essential part of rolling your own atomics. But now it's obsolete for that purpose. It does still work in practice but is never recommended: When to use volatile with multi threading? - essentially never.)

On most platforms it just happens that the architecture provides a tear-free load/store by default (for aligned int and long) so it works out the same in asm if loads and stores don't get optimized away. See Why is integer assignment on a naturally aligned variable atomic on x86? for example. In C++ it's up to you to express how the memory should be accessed in your source code instead of relying on architecture-specific features to make the code work as intended.

If you were hand-writing in asm, your source code would already nail down when values were kept in registers vs. loaded / stored to (shared) memory. In C++, telling the compiler when it can/can't keep values private is part of why std::atomic<T> exists.

If you read one article on this topic, take a look at the Preshing one here:
https://preshing.com/20130618/atomic-vs-non-atomic-operations/

Also try this presentation from CppCon 2017:
https://www.youtube.com/watch?v=ZQFzMfHIxng

Links for further reading:

Read a non-atomic variable, atomically?
https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering
Causing non-atomics to tear
https://lwn.net/Articles/793895/
What is the (slight) difference on the relaxing atomic rules? which includes a link to a Herb Sutter "atomic weapons" article which is also linked here:
https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/

Also see Peter Cordes' linked article: https://electronics.stackexchange.com/q/387181

And a related one about the Linux kernel: https://lwn.net/Articles/793253/

No tearing is only part of what you get with std::atomic<T> - you also avoid data race undefined behaviour.

Why are loads required when using atomics

This answer is mainly for C and C++ as I am not directly familiar with atomics in many other languages, but I suspect they are similar.

It's true that many actual machines work this way, in some cases.

Can and Does the Compiler Optimize Out Two Atomic Loads