Race Condition on X86

Race condition on x86

The problem can arise due to optimizations involving reordering of instructions. In other words, both processors can assign r1 and r2 before assigning variables x and y, if they find that this would yield better performance. This can be solved by adding a memory barrier, which would enforce the ordering constraint.

To quote the slideshow you mentioned in your post:

Modern multicores/languages break sequential consistency.

Regarding the x86 architecture, the best resource to read is Intel® 64 and IA-32 Architectures Software Developer’s Manual (Chapter 8.2 Memory Ordering). Sections 8.2.1 and 8.2.2 describe the memory-ordering implemented by
Intel486, Pentium, Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel
Xeon, and P6 family processors: a memory model called processor ordering, as opposed to program ordering (strong ordering) of the older Intel386 architecture (where read and write instructions were always issued in the order they appeared in the instruction stream).

The manual describes many ordering guarantees of the processor ordering memory model (such as Loads are not reordered with other loads, Stores are not reordered with other stores, Stores are not reordered with older loads etc.), but it also describes the allowed reordering rule which causes the race condition in the OP's post:

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different
Locations

On the other hand, if the original order of the instructions was switched:

shared variables
x = 0, y = 0

Core 1       Core 2
r1 = y;      r2 = x;
x = 1;       y = 1;

In this case, processor guarantees that r1 = 1 and r2 = 1 situation is not allowed (due to 8.2.3.3 Stores Are Not Reordered With Earlier Load guarantee), meaning that those instructions would never be reordered in individual cores.

To compare this with different architectures, check out this article: Memory Ordering in Modern Microprocessors. You can see that Itanium (IA-64) does even more reordering than the IA-32 architecture:

Possible CPU reorderings for various architectures

Race condition when piping through x86-64 assembly program

TL:DR: If your program does two reads before cat can refill the pipe buffer,

the 2nd read gets only 1 byte. That makes your program decide to exit prematurely.

That's the real bug. The other design choices that make this possible are performance problems, not correctness.

Your program stops after any short-read (one where the return value is less than the requested size), instead of waiting for EOF (read() == 0). This is a simplification that's sometimes safe for regular files, but not safe for anything else, especially not a TTY (terminal input), but also not for pipes or sockets. e.g. try running ./asmcat; it exits after you press return on one line, instead of waiting for control-D EOF.

Linux pipe buffers are by default only 64kiB (pipe(7) man page), 1 byte larger than the weird odd-sized buffer you're using. After cat's write fills the pipe buffer, your 65535-byte read leaves 1 byte remaining. If your program wins the race to read the pipe before cat can write again, it reads only 1 byte.

Unfortunately, running under strace ./asmcat slows down the reads too much to observe a short-read, unless you also slow down cat or whatever other program to rate-limit the write side of your input pipe.

Rate-limited input for testing:

pv(1), the pipe-viewer, is handy for this, with rate-limit -L option, and a buffer-size limit so you can make sure its writes are smaller than 64k. (Doing a larger 64k write very infrequently might not always lead to short reads.) But if we just want short reads always, running interactively reading from a terminal is even easier. strace ./asmcat

$ pv -L8K -B16K /tmp/random | strace ./orig_asmcat | wc -c
execve("./orig_asmcat", ["./orig_asmcat"], 0x7ffcd441f750 /* 55 vars */) = 0
brk(NULL)                               = 0x61c000
brk(0x62bfff)                           = 0x62bfff
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65535) = 819
write(1, "=head1 NAME\n\n=for comment  Gener"..., 819) = 819
close(0)                                = 0
exit(0)                                 = ?
+++ exited with 0 +++   # end of strace output
819                     # wc output
 819 B 0:00:00 [4.43KiB/s] [>              ]  0%        # pv's progress bar

vs. with a bugfixed asmcat, we get the expected sequence of short-reads and equal-sized writes. (See below for my version)

execve("./asmcat", ["./asmcat"], 0x7ffd8c58f600 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65536) = 819
write(1, "=head1 NAME\n\n=for comment  Gener"..., 819) = 819
read(0, "check if a\nnamed variable exists"..., 65536) = 819
write(1, "check if a\nnamed variable exists"..., 819) = 819

Code review

There are multiple wasted instructions, e.g. a mov that writes a register you never read again, like setting EDI before a call, but then the function call takes R12D as the arg, instead of the standard calling convention.

Reading argc, argv early instead of just leaving them on the stack until they're needed is similarly redundant.

.data is pointless: .set is an assemble-time constant. It doesn't matter what the current section is when you define it. You could also write it as MAX_READ_BYTES = 0xffff, more natural syntax for assemble-time constants.

You could allocate your buffer on the stack instead of with brk (it's only 64K - 1, and x86-64 Linux allows 8MiB stacks by default), in which case loading early could make sense. Or just use the BSS, e.g. lcomm buf, 1<<16

It would be a good idea to make your buffer a power of 2, or at least a multiple of the page size (4k), for efficiency. If you use it to copy files, every read after the first one will start near the end of a page, instead of copying a whole number of 4k pages, so the kernel's copy_to_user (read) and copy_from_user (write) will be touching 17 pages of kernel memory per read/write instead of 16. The pagecache for the file data may not be in contiguous kernel addresses, so each separate 4k page takes some overhead to find, and start a separate memcpy for (rep movsb on modern CPUs with the ERMSB feature). Also for disk I/O, the kernel will have to buffer your writes back into aligned chunks of some multiple of the HW sector size and/or filesystem block size.

64KiB is clearly a good choice when reading from pipes, for the same reason this race was possible. Leaving 1 byte is obviously inefficient. Also, 64k is smaller than L2 cache sizes, so the copy to/from user-space (inside the kernel in your system calls) can re-read from L2 cache when you write again. But smaller sizes mean more system calls, and each system-call has significant overhead (especially with Meltdown and Spectre mitigation in modern kernels.)

64KiB to 128KiB is about a sweet spot for buffer size, given 256KiB L2 caches being typical. (Related: code golf: Fastest yes in the West tunes a program that just makes write system-calls, with x86-64 Linux, with profiling / benchmark results on my Skylake desktop.)

Nothing in the machine code benefits from the size fitting in a uint16_t like 0xFFFF does; either int8_t or int32_t are relevant for immediate operand sizes in 64-bit code. (Or uint32_t if you're zero-extending like mov $imm32, %edx to zero-extend into RDX.)

Don't close stdin; you run close unconditionally. closing stdin doesn't affect the parent process's stdin so it shouldn't be a problem in this program, but the whole point of close seems to be to make this more like a function you could use from a large program. So you should separate your copying fd to stdout from the file-handling.

Use #include <asm/unistd.h> to get call numbers instead of hardcoding them. They're guaranteed stable, but it's more human readable / self-documenting to just use the named constants, and avoids any risk of copying errors. (Build with gcc -nostdlib -static asmcat.S -o asmcat; GCC runs .S files through the C preprocessor before assembling, unlike .s)

Style: I like to indent operands to a consistent column so they're not crowding mnemonics. Similarly, comments should be comfortably to the right of operands so you can scan down the column for instructions accessing any given register without getting distracted by comments on shorter instructions.

Comment content: The instruction itself already says what it does, the comment should describe the semantic meaning. (I don't need comments to remind me of calling conventions, like that system calls leave a result in RAX, but even if you do, summarizing the system call with a C version of it can be a good reminder of which arg is which. Like open(argv[1], O_RDONLY).)

I also like to remove redundant operand-size suffixes; the register sizes imply operand-size (just like Intel-syntax). Note that zeroing a 64-bit register only requires xorl; writing a 32-bit register implicitly zero-extends to 64-bit. Your code is sometimes inconsistent about whether things should be 32 or 64-bit. In my rewrite, I used 32-bit everywhere I could. (Except cmp %rax, %rdx return value from write, which seemed like a good idea to make 64-bit, although I don't think there's any real reason.)

My rewrite:

I removed the call/ret stuff, and just let it fall through into cleanup/exit instead of trying to separate it into "functions".

I also changed the buffer size to 64KiB exactly, allocated on the stack with 4k page alignment, and rearranged things to simplify and save instructions everywhere.

Also added a # TODO comment about short writes. That doesn't seem to happen for pipe writes up to 64k; Linux just blocks the write until the buffer has room, but could be a problem writing to a socket maybe? Or maybe only with a larger size, or if a signal like SIGTSTP or SIGSTOP interrupts write()

#include <asm/unistd.h>
BUFSIZE = 1<<16

.section .text
.globl _start
_start:
    pop  %rax      # argc
    pop  %rdi
    pop  %rdi      # argv[1]
     # you'd only ever want to read args this way in _start, which isn't a function

    and  $-4096, %rsp           # round RSP down to a page boundary.
    sub  $BUFSIZE, %rsp         # reserve 64K buffer aligned by 4k

    dec  %eax      # if argc == 1,  then run with input fd = 0   (stdin)
    jz  .Luse_stdin

    # open argv[1]
    mov     $__NR_open, %eax 
    xor     %esi, %esi     # flags: 0 means read-only.
    xor     %edx, %edx     # mode unused without O_CREAT, but zero it out for peace of mind.
    syscall       # fd = open(argv[1], O_RDONLY)

.Luse_stdin:           # don't use stdin as a symbol name; stdio.h / libc also has one of type FILE*
    mov  %eax, %ebx     # save FD
    mov  %rsp, %rsi     # always read and write the same buffer
    jmp  .Lentry        # start with a read then EOF-check as loop condition
              # since we're now error-checking the write,
              # rotating the loop maybe wasn't helpful after all
              # and perhaps just read at the top so we can fall into it would work equally well

read_and_write:              # do {
    # print the file
    mov     %eax, %edx             # size = read_size
    mov     $__NR_write, %eax      # syscall #1 = write.
    mov     $1, %edi               # output fd always stdout
    #mov     %rsp, %rsi             # buf, done once outside loop
    syscall                        # write(1, buf, read_size)

    cmp     %rax, %rdx             # written size should match request
    jne     cleanup                 # TODO: handle short writes by calling again for the unwritten part of the buffer, e.g. add %rax, %rsi
                                    # but also check for write errors.
.Lentry:
     # read the file.
    mov    $__NR_read, %eax     # xor  %eax, %eax
    mov    %ebx, %edi           # input FD
   # mov    %rsp, %rsi           # done once outside loop
    mov    $BUFSIZE, %edx
    syscall                     # size = read(fd, buf, BUFSIZE)

    test   %eax, %eax
    jg     read_and_write    # }while(read_size > 0);   // until EOF or error
# any negative can be assumed to be an error, since we pass a size smaller than INT_MAX

cleanup:
# fd might be stdin which we don't want to close.
# just exit and let kernel take care of it, or check for fd==0
#    movl $__NR_close, %eax
#    movl %ebx, %edi 
#    syscall          # close (fd)  // return value ignored

exit:
    mov  %eax, %edi             # exit status = last syscall return value. read() = 0 means EOF, success.
    mov  $__NR_exit_group, %eax
    syscall                     # exit_group(status);

For instruction counts, perf stat --all-user ./asmcat /tmp/random > /dev/null shows it runs about 47 instructions in user-space, vs. 57 for yours. (IIRC, perf over-counts by 1, so I've subtracted that from the measured result.) And that's with more error-checking, e.g. for short writes.

This is only 84 bytes of machine code in the .text section (vs. 174 bytes for your original), and I didn't optimize for size over speed with stuff like lea 1(%rsi), %eax (after zeroing RSI) instead of mov $1, %eax. (Or with mov %eax, %edi to take advantage of _NR_write == STDIN_FILENO.)

I mostly avoided R8..R15 because they need REX prefixes to access in the machine code.

Tests of error handling:

$ gcc -nostdlib -static asmcat.S -o asmcat            # build
$ cat /tmp/random | strace ./asmcat > /dev/full

execve("./asmcat", ["./asmcat"], 0x7ffde5e369d0 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65536) = 65536
write(1, "=head1 NAME\n\n=for comment  Gener"..., 65536) = -1 ENOSPC (No space left on device)
exit_group(-28)                         = ?
+++ exited with 228 +++

$ strace ./asmcat <&-      # close stdin
execve("./asmcat", ["./asmcat"], 0x7ffd0f5048c0 /* 55 vars */) = 0
read(0, 0x7ffc1b3ca000, 65536)          = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

$ strace ./asmcat /noexist
execve("./asmcat", ["./asmcat", "/noexist"], 0x7ffd429f1158 /* 55 vars */) = 0
open("/noexist", O_RDONLY)              = -1 ENOENT (No such file or directory)
read(-2, 0x7ffd4f296000, 65536)         = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

Hmm, should probably test/jl on the fd after open, if you wanted to do error handling.

Race condition and unlocked write

As you say, this is a race condition. Under C++11 it is technically a data race, and undefined behaviour. It doesn't matter that the values are the same.

If your compiler supports it (e.g. recent gcc, or gcc or MSVC with my Just::Thread library) then you can use std::atomic<some_pod_struct> to provide an atomic wrapper around your data (assuming it is a POD struct --- if it isn't then you have bigger problems). If it is small enough then the compiler will make it lock-free, and use the appropriate atomic operations. For larger structures the library will use a lock.

The problem with doing this without atomic operations or locks is visibility. Whilst there is no problem at the processor level on either x86 or ARM with writing the same data (assuming it really is byte-for-byte identical) from multiple threads/processors to the same memory, given that this is a cache, I expect you'll want to read this data rather than recalculate it if it has already been written. You'll therefore need some sort of flag to indicate done-ness. Unless you use atomic operations, locks or suitable memory barrier instructions then the "ready" flag may become visible to another processor before the data does. This will then really mess things up, as the second processor now reads an incomplete set of data.

You could write the data with non-atomic operations, and then use an atomic data type for the flag. Under C++11 this will generate suitable memory barriers and synchronization to ensure that the data is visible to any thread that sees the flag set. It's still undefined behaviour for two threads to write the data, but it may be OK in practice.

Alternatively, store the data in a block of heap memory allocated by each thread that does the calculation, and use a compare-and-swap operation to set an atomic pointer variable. If the compare-and-swap fails then another thread got there first, so free the data.

If write operation happens during exclusive cache access why is there data race?

If I read the matter right, MESI is just a red herring here:

0   reg = load(&counter);

counter has now been loaded into CPU's register.

1   reg = reg + 1;                reg = load(&counter);

First processor incremented the value, second one loaded the old one.

2   store(&counter, reg);         reg = reg + 1;

First processor stores value, second one increments its outdated one.

3                                 store(&counter, reg);

Second processor stores the result of calculation based on an outdated value.

Should be clear so far. Now how will that change if adding MESI states:

0   reg = load(&counter);

counter is in CPU 1 cache, marked E.

1   reg = reg + 1;                reg = load(&counter);

counter still resides in CPU 1 cache, but is loaded into CPU 2 cache, too. So both cache lines need to be marked as S

2   store(&counter, reg);         reg = reg + 1;

Now counter gets stored back in cache. CPU 1 cache thus needs to be marked as M, while CPU 2 cache gets invalidated (marked as I).

3                                 store(&counter, reg);

As CPU 2 cache got invalidated, it needs to be updated before the store operation can occur, which, in turn, requires CPU 1 cache to be written back to memory before (of course).

But all that being done now, the value in reg still was calculated based on an outdated value and still overwrites the (now updated) value in cache...

Adding a final detail: After the operation, CPU 2 cache will be marked M and CPU 1 cache I.

How compare_exchange C++ function determines race conditions?

The cmpxchg instruction affects the ZF flag: it is set if the exchange succeeded and it is cleared otherwise.

Let's see this with an example:

std::atomic<int> a;

bool my_compare_exchange(int expected, int desired) {
   bool succeeded = a.compare_exchange_weak(expected, desired);
   return succeeded;
}

The function my_compare_exchange() is translated into the following assembly code:

my_compare_exchange:
        mov     eax, edi
        lock cmpxchg    DWORD PTR a[rip], esi
        sete    al // <-- conditional instruction
        ret

The register al is set to 1 using sete al if the exchange succeeded (i.e., ZF was set by cmpxchg). Otherwise, it is set to zero (i.e., ZF was cleared by cmpxchg).

How can I show that volatile assignment is not atomic?

Some answers/comments suggested sleeping in the writer. This is not useful; hammering away on the cache line changing it as often as possible is what you want. (And what you get with volatile assignments and reads.) An assignment will be torn when a MESI share request for the cache line arrives at the writer core between committing two halves of a store from the store buffer to L1d cache.

If you sleep, you're waiting a long time without creating a window for that to happen. Sleeping between halves would make it even more easy to detect, but you can't do that unless you use separate memcpy to write halves of the 64-bit integer or something.

Tearing between reads in the reader is also possible even if writes are atomic. This may be less likely, but still happens plenty in practice. Modern x86 CPUs can execute two loads per clock cycle (Intel since Sandybridge, AMD since K8). I tested with atomic 64-bit stores but split 32-bit loads on Skylake and tearing is still frequent enough to spew lines of text in a terminal. So the CPU didn't manage to run everything in lock-step with corresponding pairs of reads always executing in the same clock cycle. So there is a window for the reader getting its cache line invalidated between a pair of loads. (However, all the pending cache-miss loads while the cache line is owned by the writer core probably complete all at once when the cache line arrives. And the total number of load buffers available is an even number in existing microarchitectures.)

As you discovered, your test values both had the same upper half of 0, so this made it impossible to observe any tearing; only the 32-bit aligned low half was ever changing, and was changing atomically because your compiler guarantees at least 4-byte alignment for uint64_t, and x86 guarantees that 4-byte aligned loads/stores are atomic.

0 and -1ULL are the obvious choices. I used the same thing in a test-case for this GCC C11 _Atomic bug for a 64-bit struct.

For your case, I'd do this. read() and write() are POSIX system call names so I picked something else.

#include <cstdint>
volatile uint64_t sharedValue = 0;  // initializer = one of the 2 values!

void writer() {
    for (;;) {
        sharedValue = 0;
        sharedValue = -1ULL;  // unrolling is vastly simpler than an if
    }
}

void reader() {
    for (;;) {
        uint64_t val = sharedValue;
        uint32_t low = val, high = val>>32;
        if (low != high) {
            std::cout << "Tearing! Value: " << std::hex << val << '\n';
        }
    }
}

MSVC 19.24 -O2 compiles the writer to using a movlpd 64-bit store for the = 0, but two separate 32-bit stores of -1 for the = -1. (And the reader to two separate 32-bit loads). GCC uses a total of four mov dword ptr [mem], imm32 stores in the writer, like you'd expect. (Godbolt compiler explorer)

Terminology: it's always a race condition (even with atomicity you don't know which of the two values you're going to get). With std::atomic<> you'd only have that garden-variety race condition, no Undefined Behaviour.

The question is whether you actually see tearing from the data race Undefined Behaviour on the volatile object, on a specific C++ implementation / set of compile options, for a specific platform. Data race UB is a technical term with a more specific meaning than "race condition". I changed the error message to report the one symptom we check for. Note that data-race UB on a non-volatile object can have way weirder effects, like hosting the load or store out of loops, or even inventing extra reads leading to code that thinks one read was both true and false at the same time. (https://lwn.net/Articles/793253/)

I removed 2 redundant cout flushes: one from std::endl and one from std::flush. cout is line-buffered by default, or full buffered if writing to a file, which is fine. And '\n' is just as portable as std::endl as far as DOS line endings are concerned; text vs. binary stream mode handles that. endl is still just \n.

I simplified your check for tearing by checking that high_half == low_half. Then the compiler just has to emit one cmp/jcc instead of two extended-precision compares to see if the value is either 0 or -1 exactly. We know that there's no plausible way for false-negatives like high = low = 0xff00ff00 to happen on x86 (or any other mainstream ISA with any sane compiler).

So I guess that on x86 there is no need to use std::atomic for 32 bit data types?

Incorrect.

Hand-rolled atomics with volatile int can't give you atomic RMW operations (without inline asm or special functions like Windows InterlockedIncrement or GNU C builtin __atomic_fetch_add), and can't give you any ordering guarantees wrt. other code. (Release / acquire semantics)

When to use volatile with multi threading? - pretty much never.

Rolling your own atomics with volatile is still possible and de-facto supported by many mainstream compilers (e.g. the Linux kernel still does that, along with inline asm). Real-world compilers do effectively define the behaviour of data races on volatile objects. But it's generally a bad idea when there's a portable and guaranteed-safe way. Just use std::atomic<T> with std::memory_order_relaxed to get asm that's just as efficient as what you could get with volatile (for the cases where volatile works), but with guarantees of safety and correctness from the ISO C++ standard.

atomic<T> also lets you ask the implementation whether a given type can be cheaply atomic or not, with C++17 std::atomic<T>::is_always_lock_free or the older member function. (In practice C++11 implementations decided not to let some but not all instances of any given atomic be lock free based on alignment or something; instead they just give atomic the required alignas if there is one. So C++17 made a constant per-type constant instead of per-object member function way to check lock-freedom).

std::atomic can also give cheap lock-free atomicity for types wider than a normal register. e.g. on ARM, using ARMv6 strd / ldrd to store/load a pair of registers.

On 32-bit x86, a good compiler can implement std::atomic<uint64_t> by using SSE2 movq to do atomic 64-bit loads and stores, without falling back to the non-lock_free mechanism (a table of locks). In practice, GCC and clang9 do use movq for atomic<uint64_t> load/store. clang8.0 and earlier uses lock cmpxchg8b unfortunately. MSVC uses lock cmpxchg8b in an even more inefficient way. Change the definition of sharedVariable in the Godbolt link to see it. (Or if you use one each of default seq_cst and memory_order_relaxed stores in the loop, MSVC for some reason calls a ?store@?$_Atomic_storage@_K$07@std@@QAEX_KW4memory_order@2@@Z helper function for one of them. But when both stores are the same ordering, it inlines lock cmpxchg8b with much clunkier loops than clang8.0) Note that this inefficient MSVC code-gen is for a case where volatile wasn't atomic; in cases where it is, atomic<T> with mo_relaxed compiles nicely, too.

You generally can't get that wide-atomic code-gen from volatile. Although GCC does actually use movq for your if() bool write function (see the earlier Godbolt compiler explorer link) because it can't see through the alternating or something. It also depends on what values you use. With 0 and -1 it uses separate 32-bit stores, but with 0 and 0x0f0f0f0f0f0f0f0fULL you get movq for a usable pattern. (I used this to verify that you can still get tearing from just the read side, instead of hand-writing some asm.) My simple unrolled version compiles to just use plain mov dword [mem], imm32 stores with GCC. This is a good example of there being zero guarantee of how volatile really compiles in this level of detail.

atomic<uint64_t> will also guarantee 8-byte alignment for the atomic object, even if plain uint64_t might have only been 4-byte aligned.

In ISO C++, a data race on a volatile object is still Undefined Behaviour. (Except for volatile sig_atomic_t racing with a signal handler.)

A "data race" is any time two unsynchronized accesses happen and they're not both reads. ISO C++ allows for the possibility of running on machines with hardware race detection or something; in practice no mainstream systems do that so the result is just tearing if the volatile object isn't "naturally atomic".

ISO C++ also in theory allows for running on machines that don't have coherent shared memory and require manual flushes after atomic stores, but that's not really plausible in practice. No real-world implementations are like that, AFAIK. Systems with cores that have non-coherent shared memory (like some ARM SoCs with DSP cores + microcontroller cores) don't start std::thread across those cores.

See also Why is integer assignment on a naturally aligned variable atomic on x86?

It's still UB even if you don't observe tearing in practice, although as I said real compilers do de-facto define the behaviour of volatile.

Skylake experiments to try to detect store-buffer coalescing

I wondered if store coalescing in the store buffer could maybe create an atomic 64-bit commit to L1d cache out of two separate 32-bit stores. (No useful results so far, leaving this here in case anyone's interested or wants to build on it.)

I used a GNU C __atomic builtin for the reader, so if the stores also ended up being atomic we'd see no tearing.

Race Condition on X86