Does Std::Mutex Create a Fence

Does std::mutex create a fence?

Unlocking a mutex synchronizes with locking the mutex. I don't know what options the compiler has for the implementation, but you get the same effect of a fence.

Does `std::mutex` and `std::lock` guarantee memory synchronisation in inter-processor code?

The standard makes the following guarantees about synchronization of std::mutex, in §30.4.1.2[thread.mutex.requirements.mutex]/6-25

The expression m.lock() shall be well-formed and have the following semantics
Synchronization: Prior unlock() operations on the same object shall synchronize with this operation.

And, likewise,

The expression m.unlock() shall be well-formed and have the following semantics
Synchronization: This operation synchronizes with subsequent lock operations that obtain ownership on the same object.

(Where "synchronizes with" is a specific term explained in $1.10, although it's much easier to understand by reading C++ Concurrency In Action)

Memory model ordering and visibility?

If you like to deal with fences, then a.load(memory_order_acquire) is equivalent to a.load(memory_order_relaxed) followed by atomic_thread_fence(memory_order_acquire). Similarly, a.store(x,memory_order_release) is equivalent to a call to atomic_thread_fence(memory_order_release) before a call to a.store(x,memory_order_relaxed). memory_order_consume is a special case of memory_order_acquire, for dependent data only. memory_order_seq_cst is special, and forms a total order across all memory_order_seq_cst operations. Mixed with the others it is the same as an acquire for a load, and a release for a store. memory_order_acq_rel is for read-modify-write operations, and is equivalent to an acquire on the read part and a release on the write part of the RMW.

The use of ordering constraints on atomic operations may or may not result in actual fence instructions, depending on the hardware architecture. In some cases the compiler will generate better code if you put the ordering constraint on the atomic operation rather than using a separate fence.

On x86, loads are always acquire, and stores are always release. memory_order_seq_cst requires stronger ordering with either an MFENCE instruction or a LOCK prefixed instruction (there is an implementation choice here as to whether to make the store have the stronger ordering or the load). Consequently, standalone acquire and release fences are no-ops, but atomic_thread_fence(memory_order_seq_cst) is not (again requiring an MFENCE or LOCKed instruction).

An important effect of the ordering constraints is that they order other operations.

std::atomic<bool> ready(false);
int i=0;

void thread_1()
{
    i=42;
    ready.store(true,memory_order_release);
}

void thread_2()
{
    while(!ready.load(memory_order_acquire)) std::this_thread::yield();
    assert(i==42);
}

thread_2 spins until it reads true from ready. Since the store to ready in thread_1 is a release, and the load is an acquire then the store synchronizes-with the load, and the store to i happens-before the load from i in the assert, and the assert will not fire.

2) The second line in

atomicVar.store(42);
std::atomic_thread_fence(std::memory_order_seq_cst);

is indeed potentially redundant, because the store to atomicVar uses memory_order_seq_cst by default. However, if there are other non-memory_order_seq_cst atomic operations on this thread then the fence may have consequences. For example, it would act as a release fence for a subsequent a.store(x,memory_order_relaxed).

3) Fences and atomic operations do not work like mutexes. You can use them to build mutexes, but they do not work like them. You do not have to ever use atomic_thread_fence(memory_order_seq_cst). There is no requirement that any atomic operations are memory_order_seq_cst, and ordering on non-atomic variables can be achieved without, as in the example above.

4) No these are not equivalent. Your snippet without the mutex lock is thus a data race and undefined behaviour.

5) No your assert cannot fire. With the default memory ordering of memory_order_seq_cst, the store and load from the atomic pointer p work like the store and load in my example above, and the stores to the array elements are guaranteed to happen-before the reads.

Should combining memory fence for mutex acquire-exchange loop (or queue acquire-load loop) be done or should it be avoided?

Yes, the general idea of avoiding an acquire barrier inside the failure retry path is possibly useful, although performance in the failure case is barely relevant if you're just spinning. pause or yield save power. On x86, pause also improves SMT friendlyness, and avoids memory-order mis-speculation when leaving the loop after another core modified the memory location you're spinning on.

But that's why CAS has separate memory_order parameters for success and failure. Relaxed failure could let the compiler only barrier on the leave-the-loop path.

atomic_flag test_and_set doesn't have that option, though. Doing it manually potentially hurts ISAs like AArch64 that could have done an acquire RMW and avoided an explicit fence instruction. (e.g. with ldarb)

Godbolt: Original loop with lock.test_and_set(std::memory_order_acquire):

# AArch64 gcc8.2 -O3
.L6:                            # do{
    ldaxrb  w0, [x19]           # acquire load-exclusive
    stxrb   w1, w20, [x19]      # relaxed store-exclusive
    cbnz    w1, .L6            # LL/SC failure retry
    tst     w0, 255
    bne     .L6             # }while(old value was != 0)
  ... no barrier after this

(And yes, it looks like a missed optimization that it's only testing the low 8 bits with tst instead of just cbnz w1, .L6)

while(relaxed RMW) + std::atomic_thread_fence(std::memory_order_acquire);

.L14:                          # do {
    ldxrb   w0, [x19]             # relaxed load-exclusive
    stxrb   w1, w20, [x19]        # relaxed store-exclusive
    cbnz    w1, .L14             # LL/SC retry
    tst     w0, 255
    bne     .L14               # }while(old value was != 0)
    dmb     ishld         #### Acquire fence
   ...

It's even worse for 32-bit ARMv8 where dmb ishld isn't available, or compilers don't use it. You'll get a dmb ish full barrier.

Or with `-march=armv8.1-a`

.L2:
    swpab   w20, w0, [x19]
    tst     w0, 255
    bne     .L2
    mov     x2, 19
  ...

vs.

.L9:
    swpb    w20, w0, [x19]
    tst     w0, 255
    bne     .L9
    dmb     ishld                   # acquire barrier (load ordering)
    mov     x2, 19
...

How does a C++ std::mutex bind to a resource?

Given that m is a variable of type std::mutex:

Imagine this sequence:

int a;
m.lock();
b += 1;
a = b;
m.unlock();
do_something_with(a);

There is an 'obvious' thing going on here:

The assignment of a from b and the increment of b is 'protected' from interference from other threads, because other threads will attempt to lock the same m and will be blocked until we call m.unlock().

And there is a more subtle thing going on.

In single-threaded code, the compiler will seek to re-order loads and stores. Without the locks, the compiler would be free to effectively re-write your code if this turned out to be more efficient on your chipset:

int a = b + 1;
//    m.lock();
b = a;
//    m.unlock();
do_something_with(a);

Or even:

do_something_with(++b);

However, std::mutex::lock(), unlock(), std::thread(), std::async(), std::future::get() and so on are fences. The compiler 'knows' that it may not reorder loads and stores (reads and writes) in such a way that the operation ends up on the other side of the fence from where you specified when you wrote the code.

1:
2:    m.lock(); <--- This is a fence
3:    b += 1;   <--- So this load/store operation may not move above line 2
4:    m.unlock(); <-- Nor may it be moved below this line

Imagine what would happen if this wasn't the case:

(Reordered code)

thread1: int a = b + 1;
  <--- Here another thread precedes us and executes the same block of code
  thread2: int a = b + 1;
  thread2: m.lock();
  thread2: b = a;
  thread2: m.unlock();
thread1: m.lock();
thread1: b = a;
thread1: m.unlock();
thread1:do_something_with(a);
thread2:do_something_with(a);

If you follow it through, you'll see that b now has the wrong value in it, because the compiler was tying to make your code faster.

...and that's only the compiler optimisations. std::mutex, etc. also prevents the memory caches from reordering loads and stores in a more 'optimal' way, which would be fine in a single-threaded environment but disastrous in a multi-core (i.e. any modern PC or phone) system.

There is a cost for this safety, because thread A's cache must be flushed before thread B reads the same data, and flushing caches to memory is hideously slow compared to cached memory access. But c'est la vie. It's the only way to make concurrent execution safe.

This is why we prefer that, if possible, in an SMP system, each thread has its own copy of data on which to work. We want to minimise not only the time spent in a lock, but also the number of times we cross a fence.

I could go on to talk about the std::memory_order modifiers, but that is a dark and dangerous hole, which experts often get wrong and in which beginners have no hope of getting it right.

Are the memory barriers correct for this lock?

The usual pattern is to use test_and_set(memory_order_acquire) and clear(memory_order_release). But I suspect you know that already.

According to the standard section 29.8 [atomic.fences] (2):

A release fence A synchronizes with an acquire fence B if there exist
atomic operations X and Y, both operating on some atomic object M,
such that A is sequenced before X, X modifies M, Y is sequenced before
B, and Y reads the value written by X or a value written by any side
effect in the hypothetical release sequence X would head if it were a
release operation.

In your code, A is the fence in your unlock() function; X is the clear(); Y is the fence in your lock() function; and B is the test_and_set(). So your code meets the requirements of this section of the standard and therefore your unlock() and lock() functions are properly synchronized.

Does Std::Mutex Create a Fence