Atomic Double Floating Point or Sse/Avx Vector Load/Store on X86_64

Atomic double floating point or SSE/AVX vector load/store on x86_64

C++ doesn't support something like lock-free std::atomic<double>

Actually, C++11 std::atomic<double> is lock-free on typical C++ implementations, and does expose nearly everything you can do in asm for lock-free programming with float/double on x86 (e.g. load, store, and CAS are enough to implement anything: Why isn't atomic double fully implemented). Current compilers don't always compile atomic<double> efficiently, though.

C++11 std::atomic doesn't have an API for Intel's transactional-memory extensions (TSX) (for FP or integer). TSX could be a game-changer especially for FP / SIMD, since it would remove all overhead of bouncing data between xmm and integer registers. If the transaction doesn't abort, whatever you just did with double or vector loads/stores happens atomically.

Some non-x86 hardware supports atomic add for float/double, and C++ p0020 is a proposal to add fetch_add and operator+= / -= template specializations to C++'s std::atomic<float> / <double>.

Hardware with LL/SC atomics instead of x86-style memory-destination instruction, such as ARM and most other RISC CPUs, can do atomic RMW operations on double and float without a CAS, but you still have to get the data from FP to integer registers because LL/SC is usually only available for integer regs, like x86's cmpxchg. However, if the hardware arbitrates LL/SC pairs to avoid/reduce livelock, it would be significantly more efficient than with a CAS loop in very-high-contention situations. If you've designed your algorithms so contention is rare, there's maybe only a small code-size difference between an LL/add/SC retry-loop for fetch_add vs. a load + add + LL/SC CAS retry loop.

x86 natually-aligned loads and stores are atomic up to 8 bytes, even x87 or SSE. (For example movsd xmm0, [some_variable] is atomic, even in 32-bit mode). In fact, gcc uses x87 fild/fistp or SSE 8B loads/stores to implement std::atomic<int64_t> load and store in 32-bit code.

Ironically, compilers (gcc7.1, clang4.0, ICC17, MSVC CL19) do a bad job in 64-bit code (or 32-bit with SSE2 available), and bounce data through integer registers instead of just doing movsd loads/stores directly to/from xmm regs (see it on Godbolt):

#include <atomic>
std::atomic<double> ad;

void store(double x){
    ad.store(x, std::memory_order_release);
}
//  gcc7.1 -O3 -mtune=intel:
//    movq    rax, xmm0               # ALU xmm->integer
//    mov     QWORD PTR ad[rip], rax
//    ret

double load(){
    return ad.load(std::memory_order_acquire);
}
//    mov     rax, QWORD PTR ad[rip]
//    movq    xmm0, rax
//    ret

Without -mtune=intel, gcc likes to store/reload for integer->xmm. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and related bugs I reported. This is a poor choice even for -mtune=generic. AMD has high latency for movq between integer and vector regs, but it also has high latency for a store/reload. With the default -mtune=generic, load() compiles to:

//    mov     rax, QWORD PTR ad[rip]
//    mov     QWORD PTR [rsp-8], rax   # store/reload integer->xmm
//    movsd   xmm0, QWORD PTR [rsp-8]
//    ret

Moving data between xmm and integer register brings us to the next topic:

Atomic read-modify-write (like fetch_add) is another story: there is direct support for integers with stuff like lock xadd [mem], eax (see Can num++ be atomic for 'int num'? for more details). For other things, like atomic<struct> or atomic<double>, the only option on x86 is a retry loop with cmpxchg (or TSX).

Atomic compare-and-swap (CAS) is usable as a lock-free building-block for any atomic RMW operation, up to the max hardware-supported CAS width. On x86-64, that's 16 bytes with cmpxchg16b (not available on some first-gen AMD K8, so for gcc you have to use -mcx16 or -march=whatever to enable it).

gcc makes the best asm possible for exchange():

double exchange(double x) {
    return ad.exchange(x); // seq_cst
}
    movq    rax, xmm0
    xchg    rax, QWORD PTR ad[rip]
    movq    xmm0, rax
    ret
  // in 32-bit code, compiles to a cmpxchg8b retry loop


void atomic_add1() {
    // ad += 1.0;           // not supported
    // ad.fetch_or(-0.0);   // not supported
    // have to implement the CAS loop ourselves:

    double desired, expected = ad.load(std::memory_order_relaxed);
    do {
        desired = expected + 1.0;
    } while( !ad.compare_exchange_weak(expected, desired) );  // seq_cst
}

    mov     rax, QWORD PTR ad[rip]
    movsd   xmm1, QWORD PTR .LC0[rip]
    mov     QWORD PTR [rsp-8], rax    # useless store
    movq    xmm0, rax
    mov     rax, QWORD PTR [rsp-8]    # and reload
.L8:
    addsd   xmm0, xmm1
    movq    rdx, xmm0
    lock cmpxchg    QWORD PTR ad[rip], rdx
    je      .L5
    mov     QWORD PTR [rsp-8], rax
    movsd   xmm0, QWORD PTR [rsp-8]
    jmp     .L8
.L5:
    ret

compare_exchange always does a bitwise comparison, so you don't need to worry about the fact that negative zero (-0.0) compares equal to +0.0 in IEEE semantics, or that NaN is unordered. This could be an issue if you try to check that desired == expected and skip the CAS operation, though. For new enough compilers, memcmp(&expected, &desired, sizeof(double)) == 0 might be a good way to express a bitwise comparison of FP values in C++. Just make sure you avoid false positives; false negatives will just lead to an unneeded CAS.

Hardware-arbitrated lock or [mem], 1 is definitely better than having multiple threads spinning on lock cmpxchg retry loops. Every time a core gets access to the cache line but fails its cmpxchg is wasted throughput compared to integer memory-destination operations that always succeed once they get their hands on a cache line.

Some special cases for IEEE floats can be implemented with integer operations. e.g. absolute value of an atomic<double> could be done with lock and [mem], rax (where RAX has all bits except the sign bit set). Or force a float / double to be negative by ORing a 1 into the sign bit. Or toggle its sign with XOR. You could even atomically increase its magnitude by 1 ulp with lock add [mem], 1. (But only if you can be sure it wasn't infinity to start with... nextafter() is an interesting function, thanks to the very cool design of IEEE754 with biased exponents that makes carry from mantissa into exponent actually work.)

There's probably no way to express this in C++ that will let compilers do it for you on targets that use IEEE FP. So if you want it, you might have to do it yourself with type-punning to atomic<uint64_t> or something, and check that FP endianness matches integer endianness, etc. etc. (Or just do it only for x86. Most other targets have LL/SC instead of memory-destination locked operations anyway.)

can't yet support something like atomic AVX/SSE vector because it's CPU-dependent

Correct. There's no way to detect when a 128b or 256b store or load is atomic all the way through the cache-coherency system. (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490). Even a system with atomic transfers between L1D and execution units can get tearing between 8B chunks when transferring cache-lines between caches over a narrow protocol. Real example: a multi-socket Opteron K10 with HyperTransport interconnects appears to have atomic 16B loads/stores within a single socket, but threads on different sockets can observe tearing.

But if you have a shared array of aligned doubles, you should be able to use vector loads/stores on them without risk of "tearing" inside any given double.

Per-element atomicity of vector load/store and gather/scatter?

I think it's safe to assume that an aligned 32B load/store is done with non-overlapping 8B or wider loads/stores, although Intel doesn't guarantee that. For unaligned ops, it's probably not safe to assume anything.

If you need a 16B atomic load, your only option is to lock cmpxchg16b, with desired=expected. If it succeeds, it replaces the existing value with itself. If it fails, then you get the old contents. (Corner-case: this "load" faults on read-only memory, so be careful what pointers you pass to a function that does this.) Also, the performance is of course horrible compared to actual read-only loads that can leave the cache line in Shared state, and that aren't full memory barriers.

16B atomic store and RMW can both use lock cmpxchg16b the obvious way. This makes pure stores much more expensive than regular vector stores, especially if the cmpxchg16b has to retry multiple times, but atomic RMW is already expensive.

The extra instructions to move vector data to/from integer regs are not free, but also not expensive compared to lock cmpxchg16b.

# xmm0 -> rdx:rax, using SSE4
movq   rax, xmm0
pextrq rdx, xmm0, 1


# rdx:rax -> xmm0, again using SSE4
movq   xmm0, rax
pinsrq xmm0, rdx, 1

In C++11 terms:

atomic<__m128d> would be slow even for read-only or write-only operations (using cmpxchg16b), even if implemented optimally. atomic<__m256d> can't even be lock-free.

alignas(64) atomic<double> shared_buffer[1024]; would in theory still allow auto-vectorization for code that reads or writes it, only needing to movq rax, xmm0 and then xchg or cmpxchg for atomic RMW on a double. (In 32-bit mode, cmpxchg8b would work.) You would almost certainly not get good asm from a compiler for this, though!

You can atomically update a 16B object, but atomically read the 8B halves separately. (I think this is safe with respect to memory-ordering on x86: see my reasoning at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835).

However, compilers don't provide any clean way to express this. I hacked up a union type-punning thing that works for gcc/clang: How can I implement ABA counter with c++11 CAS?. But gcc7 and later won't inline cmpxchg16b, because they're re-considering whether 16B objects should really present themselves as "lock-free". (https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html).

Why is the compiler dedicating a memory location for storing a redundant variable in this case?

Is there any specific reason to store that variable in stack rather than in registers?

At the end of the day, atomics exist for inter-thread communication, and you can't share a register across threads.

You might think that gcc could detect local variable atomics that are never shared with anything else and demote them to a regular variable. However:

I personally don't see what this brings to the table since you shouldn't be using atomics in these cases.
The standard appears to prohibit such an optimization anyways:

intro.races-14

The value of an atomic object M, as determined by evaluation B, shall be the value stored by some side effect A that modifies M, where B does not happen before A.

The key word here is side effect, which means that the modification of the actual memory storage is not up for debate. It HAS to happen.

As far as the revised question goes:

But in the above code, eax is invariant

It's unfortunately not. cmpxchg both reads and writes to eax, so it needs to be reassigned at each iteration of the loop.

The loop is needed because in order to perform a += 1 on an atomic float. The compiler has to keep trying until it manages to do the read-increment-write sequence fast enough that the atomic doesn't change in the meantime.

How to perform basic operations with std::atomic when the type is not Integral?

As a rule, the C++ standard library tries to provide only operations that can be implemented efficiently. For std::atomic, that means operations that can be performed lock-free in an instruction or two on "common" architectures. "Common" architectures have atomic fetch-and-add instructions for integers, but not for floating point types.

If you want to implement math operations for atomic floating point types, you'll have to do so yourself with a CAS (compare and swap) loop (Live at Coliru):

std::atomic<double> foo{0};

void add_to_foo(double bar) {
  auto current = foo.load();
  while (!foo.compare_exchange_weak(current, current + bar))
    ;
}

Why isn't atomic double fully implemented

std::atomic<double> is supported in the sense that you can create one in your program and it will work under the rules of C++11. You can perform loads and stores with it and do compare-exchange and the like.

The standard specifies that arithmetic operations (+, *, +=, &, etc.) are only provided for atomics of "integral types", so an std::atomic<double> won't have any of those operations defined.

My understanding is that, because there is little support for fetch-add or any other atomic arithmetic operations for floating point types in hardware in use today, the C++ standard doesn't provide the operators for them because they would have to be implemented inefficiently.

(edit). As an aside, std::atomic<double> in VS2015RC is lock-free.

Does x86-SSE-instructions have an automatic release-acquire order?

Here is an excerpt from Intel's Software Developers Manual, volume 3, section 8.2.2 (the edition 325384-052US of September 2014):

Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes, with the following exceptions:

writes executed with the CLFLUSH instruction;

streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and

string operations (see Section 8.2.4.1).

Reads may be reordered with older writes to different locations but not with older writes to the same location.

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

Reads cannot pass earlier LFENCE and MFENCE instructions.

Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

LFENCE instructions cannot pass earlier reads.

SFENCE instructions cannot pass earlier writes.

MFENCE instructions cannot pass earlier reads or writes.

The first three bullets describe the release-acquire ordering, and the exceptions are explicitly listed there. As you might see, only cacheability control instructions (MOVNT*) are in the exception list, while the rest of SSE/SSE2 and other vector instructions obey to the general memory ordering rules, and do not require use of [LSM]FENCE.

Atomic Double Floating Point or Sse/Avx Vector Load/Store on X86_64