Acquire/Release Versus Sequentially Consistent Memory Order

Acquire/Release versus Sequentially Consistent memory order

The C++11 memory ordering parameters for atomic operations specify constraints on the ordering. If you do a store with std::memory_order_release, and a load from another thread reads the value with std::memory_order_acquire then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations.

If both the store and subsequent load are std::memory_order_seq_cst then the relationship between these two threads is the same. You need more threads to see the difference.

e.g. std::atomic<int> variables x and y, both initially 0.

Thread 1:

x.store(1,std::memory_order_release);

Thread 2:

y.store(1,std::memory_order_release);

Thread 3:

int a=x.load(std::memory_order_acquire); // x before y
int b=y.load(std::memory_order_acquire);

Thread 4:

int c=y.load(std::memory_order_acquire); // y before x
int d=x.load(std::memory_order_acquire);

As written, there is no relationship between the stores to x and y, so it is quite possible to see a==1, b==0 in thread 3, and c==1 and d==0 in thread 4.

If all the memory orderings are changed to std::memory_order_seq_cst then this enforces an ordering between the stores to x and y. Consequently, if thread 3 sees a==1 and b==0 then that means the store to x must be before the store to y, so if thread 4 sees c==1, meaning the store to y has completed, then the store to x must also have completed, so we must have d==1.

In practice, then using std::memory_order_seq_cst everywhere will add additional overhead to either loads or stores or both, depending on your compiler and processor architecture. e.g. a common technique for x86 processors is to use XCHG instructions rather than MOV instructions for std::memory_order_seq_cst stores, in order to provide the necessary ordering guarantees, whereas for std::memory_order_release a plain MOV will suffice. On systems with more relaxed memory architectures the overhead may be greater, since plain loads and stores have fewer guarantees.

Memory ordering is hard. I devoted almost an entire chapter to it in my book.

Understanding `memory_order_acquire` and `memory_order_release` in C++11

Acquire and Release are Memory Barriers.
If your program reads data after an acquire barrier you are assured you will be reading data consistent in order with any preceding release by any other thread in respect of the same atomic variable. Atomic variables are guaranteed to have an absolute order (when using memory_order_acquire and memory_order_release though weaker operations are provided for) to their reads and writes across all threads. These barriers in effect propagate that order to any threads using that atomic variable.
You can use atomics to indicate something has 'finished' or is 'ready' but if the consumer reads beyond that atomic variable the consumer can't be rely on 'seeing' the right 'versions' of other memory and atomics would have limited value.

The statements about 'moving before' or 'moving after' are instructions to the optimizer that it shouldn't re-order operations to take place out of order. Optimizers are very good at re-ordering instructions and even omitting redundant reads/writes but if they re-organise the code across the memory barriers they may unwittingly violate that order.

Your code relies on the std::string object (a) having been constructed in producer() before ptr is assigned and (b) the constructed version of that string (i.e. the version of the memory it occupies) being the one that consumer() reads.
Put simply consumer() is going to eagerly read the string as soon as it sees ptr assigned so it damn well better see a valid and fully constructed object or bad times will ensue.
In that code 'the act' of assigning ptr is how producer() 'tells' consumer the string is 'ready'. The memory barrier exists to make sure that's what the consumer sees.

Conversely if ptr was declared as an ordinary std::string * then the compiler could decide to optimize p away and assign the allocated address directly to ptr and only then construct the object and assign the int data. That is likely a disaster for the consumer thread which is using that assignment as the indicator that the objects producer is preparing are ready.
To be accurate if ptr were a pointer the consumer may never see the value assigned or on some architectures read a partially assigned value where only some of the bytes have been assigned and it points to a garbage memory location. However those aspects are about it being atomic not the wider memory barriers.

What's are practical example where acquire release memory order differs from sequential consistency?

Peterson's algorithm is an example of something that requires sequential consistency.

In the days before mutexes, the algoritm was used to give a single thread access to a protected area.
The algoritm only works with 2 threads, each managing a flag that expresses the intention to access the protected area.
If both set the flag at (about) the same time, both will backoff and try again.
The real algorithm is more advanced since it uses a 'turn' flag to manage fair access, but to show the difference between seq/cst and acq/rel this is not necessary.

Below is a ready-to-compile, simplified version of the Peterson algorithm that actually shows that the algoritms is broken if something weaker than sequential consistency is used.
Interestingly enough, this is broken even on X86 since that platform allows store-loads to be reordered.

The problem with store-load reordering is that both threads may express their intention to access the protected area by setting their me flag to true, while both are reading false from the him flag (since the value was not propagated yet to both threads) and enter the protected area. This is not possible with sequential consistency.

With gcc, I had to compile with -O3 optimizations to have the assert fire, with clang that was not necessary.
Both compilers use a different approach to implement sequential consistency.

#include <thread>
#include <atomic>
#include <assert.h>

std::atomic<bool> flag1{false};
std::atomic<bool> flag2{false};

std::atomic<int> counter{0};

// Change these to memory_order_seq_cst to fix the algorithm
static const auto store_ordering = std::memory_order_release;
static const auto load_ordering  = std::memory_order_acquire;

void busy(int n)
{
    auto &me  = (n==1) ? flag1 : flag2;
    auto &him = (n==1) ? flag2 : flag1;

    for (;;)
    {
        for (;;)
        {
            me.store(true, store_ordering);
            if (him.load(load_ordering) == false)
            {
                // got the 'lock'
                break;
            }

            // retention, no wait period -> busy loop
            me.store(false, store_ordering);
        }

        int tmp = counter.fetch_add(1, std::memory_order_relaxed);
        assert(tmp == 0);

        /*
         * critical area
         */

        tmp = counter.fetch_sub(1, std::memory_order_relaxed);
        assert(tmp == 1);

        me.store(false, store_ordering);
    }
}

int main()
{
    std::thread t1{busy, 1};
    std::thread t2{busy, 2};

    t1.join();
    t2.join();
}

Acquire/Release VS Sequential Consistency in C++11？

Yes, the assert can fire.

The principal property that is not guaranteed by acquire / release is a single total order of modifications. It only guarantees that (the non-existent) previous actions of a and b are observed by c and d if they see true from the loads.

A (slightly contrived) example of this is on a multi-cpu (physical socket) system that isn't fully cache-coherant. Die 1 has core A running thread a and core C running thread c. Die 2 has core B running thread b and core D running thread d. The interconnect between the two sockets has a long latency when compared to a memory operation that hits on-die cache.

a and b run at the same wall clock time. C is on-die with A, so can see the store to x immediately, but the interconnect delays it's observation of the store to y, so it sees the old value. Similarly D is on-die with B, so it sees the store to y, but misses the store to x.

Whereas if you have sequential consistency, some co-ordination is required to enforce a total order, such as "C and D are blocked while the interconnect syncs the caches".

How do memory_order_seq_cst and memory_order_acq_rel differ?

http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false;
bool y= false;
int z= 0;

a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }

// kick off a, b, c, d, join all threads
assert(z!=0);

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.

Acquire/Release Versus Sequentially Consistent Memory Order