What Is The Mutex Acquire and Release Order

What is the Mutex acquire and release order?

Once Thread 1 releases the lock, what happens next is non-deterministic. Any of the scenarios you outlined above are possible.

If your application requires a very specific order among threads, then you might want to try having the threads communicate more explicitly among themselves. In C, you can do this with a pipe().

Generally though, the performance is best if you embrace the chaos and let the scheduler choose.

Release and Acquire with std::mutex

Per 30.4.1.2p11,

Synchronization: Prior unlock() operations on the same object shall synchronize with (1.10) [m.lock()].

Under 1.10p5,

[...] For example, a call that acquires a mutex will perform an acquire operation
on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. [...]

Do mutexes guarantee ordering of acquisition? Unlocking thread takes it again while others are still waiting

Known problem. C++ mutexes are thin layer on top of OS-provided mutexes, and OS-provided mutexes are often not fair. They do not care for FIFO.

The other side of the same coin is that threads are usually not pre-empted until they run out of their time slice. As a result, thread A in this scenario was likely to continue to be executed, and got the mutex right away because of that.

Release/Acquire semantics wrt std::mutex

Now can it be said that std::mutex::lock will have acquire semantics and that std::mutex::unlock essentially has release semantics?

Yes, this is correct.

From what I understand synchronize with is not explicitly defined in the standard

Well, in theory Paragraph 1.10/8 is probably meant to give the definition of synchronizes with:

Certain library calls synchronize with other library calls performed by another thread. For example, an
atomic store-release synchronizes with a load-acquire that takes its value from the store (29.3). [Note: ...]

On the other hand, this does not sound like a very formal definition. However, a better, though implicit one is indirectly given in Paragraph 1.10/10:

An evaluation A is dependency-ordered before an evaluation B if

— A performs a release operation on an atomic object M, and, in another thread, B performs a consume
operation on M and reads a value written by any side effect in the release sequence headed by A, or

— for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.

[ Note: The relation “is dependency-ordered before” is analogous to “synchronizes with”, but uses release/-
consume in place of release/acquire. —end note ]

Since the "is analogous to" relationship is most often symmetric, I would say that the above definition of "is-dependency-ordered before" indirectly provides a definition of "synchronizes with" as well - although you might correctly object that notes are non-normative; still, this seems to be the intended definition.

My intuition of the synchronizes with relationship is that it occurs between a write (atomic) operation performed by one thread that stores a certain value and the first (atomic) operation that reads that value. That operation might as well be in the same thread.

If the two operations are on different threads, then the synchronizes-with relation establishes a cross-thread ordering on operations.

In the Standard I can find this under section

30.4.1.2 Mutex types [thread.mutex.requirements.mutex]

11 Synchronization: Prior unlock() operations on the same object shall synchronize with (1.10) this operation.

To me, this seems compatible with the interpretation given above. An operation with release semantics (unlock, store) will synchronize with an operation of acquire semantics (lock, load).

however, from my understanding of acquire/release semantics, this has more to do with memory reordering. synchronize with could also be called release/acquire semantics?

Release and acquire semantics describe the nature of some operations; the synchronizes-with relationship is (indeed) a relationship which is established between operations that have acquire or release semantics, in a well-defined way.

So in a sense, synchronizes-with is a consequence of the semantics of those operations, and we use those semantics to achieve the correct ordering of instructions and constraint the possible reordering that the CPU or the compiler will perform.

Understanding `memory_order_acquire` and `memory_order_release` in C++11

Acquire and Release are Memory Barriers.
If your program reads data after an acquire barrier you are assured you will be reading data consistent in order with any preceding release by any other thread in respect of the same atomic variable. Atomic variables are guaranteed to have an absolute order (when using memory_order_acquire and memory_order_release though weaker operations are provided for) to their reads and writes across all threads. These barriers in effect propagate that order to any threads using that atomic variable.
You can use atomics to indicate something has 'finished' or is 'ready' but if the consumer reads beyond that atomic variable the consumer can't be rely on 'seeing' the right 'versions' of other memory and atomics would have limited value.

The statements about 'moving before' or 'moving after' are instructions to the optimizer that it shouldn't re-order operations to take place out of order. Optimizers are very good at re-ordering instructions and even omitting redundant reads/writes but if they re-organise the code across the memory barriers they may unwittingly violate that order.

Your code relies on the std::string object (a) having been constructed in producer() before ptr is assigned and (b) the constructed version of that string (i.e. the version of the memory it occupies) being the one that consumer() reads.
Put simply consumer() is going to eagerly read the string as soon as it sees ptr assigned so it damn well better see a valid and fully constructed object or bad times will ensue.
In that code 'the act' of assigning ptr is how producer() 'tells' consumer the string is 'ready'. The memory barrier exists to make sure that's what the consumer sees.

Conversely if ptr was declared as an ordinary std::string * then the compiler could decide to optimize p away and assign the allocated address directly to ptr and only then construct the object and assign the int data. That is likely a disaster for the consumer thread which is using that assignment as the indicator that the objects producer is preparing are ready.
To be accurate if ptr were a pointer the consumer may never see the value assigned or on some architectures read a partially assigned value where only some of the bytes have been assigned and it points to a garbage memory location. However those aspects are about it being atomic not the wider memory barriers.

What are the exact inter-thread reordering constraints on mutex.lock() and .unlock() in c++11 and up?

Almost a duplicate: How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release? - that's using hand-rolled std::atomic spinlocks, but the same reasoning applies:

The compiler can't compile-time reorder mutex acquire and release in ways that could introduce a deadlock where the C++ abstract machine doesn't have one. That would violate the as-if rule.

It would effectively be introducing an infinite loop in a place the source doesn't have one, violating this rule:

ISO C++ current draft, section 6.9.2.3 Forward progress
18. An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

The ISO C++ standard doesn't distinguish compile-time vs. run-time reordering. In fact it doesn't say anything about reordering. It only says things about when you're guaranteed to see something because of synchronizes-with effects, and the existence of a modification order for each atomic object, and the total order of seq_cst operations. It's a misreading of the standard to take it as permission to nail things down into asm in a way that requires mutexes to be taken in a different order than source order.

Taking a mutex is essentially equivalent to an atomic RMW with memory_order_acquire on the mutex object. (And in fact the ISO C++ standard even groups them together in 6.9.2.3 :: 18 quoted above.)

You're allowed to see an earlier release or relaxed store or even RMW appear inside a mutex lock/unlock critical section instead of before it. But the standard requires an atomic store (or sync operation) to be visible to other threads promptly, so compile-time reordering to force it to wait until after a lock had been acquired could violate that promptness guarantee. So even a relaxed store can't compile-time / source-level reorder with a mutex.lock(), only as a run-time effect.

This same reasoning applies to mutex2.lock(). You're allowed to see reordering, but the compiler can't create a situation where the code requires that reordering to always happen, if that makes execution different from the C++ abstract machine in any important / long-term observable ways. (e.g. reordering around an unbounded wait). Creating a deadlock counts as one of those ways, whether for this reason or another. (Every sane compiler developer would agree on that, even if C++ didn't have formal language to forbid it.)

Note that mutex unlock can't block, so compile-time reordering of two unlocks isn't forbidden for that reason. (If there are no slow or potentially blocking operations in between). But mutex unlock is a "release" operation, so that's ruled out: two release stores can't reorder with each other.

And BTW, the practical mechanism for preventing compile-time reordering of mutex.lock() operations is just to make them regular function calls that the compiler doesn't know how to inline. It has to assume that functions aren't "pure", i.e. that they have side effects on global state, and thus the order might be important. That's the same mechanism that keeps operations inside the critical section: How does a mutex lock and unlock functions prevents CPU reordering?

An inlinable std::mutex written with std::atomic would end up depending on the compiler actually applying the rules about making operations visible promptly and not introducing deadlocks by reordering things at compile-time. As described in How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release?

Does order of unlocking mutexes make a difference here?

I cannot for the life of me figure out how a deadlock could result from this though if the locks are always obtained in the same order wherever both are used.

In these circumstances, I don't think the order of unlocking the mutexes could be the cause of a deadlock.

Since pthread_mutex_unlock() doesn't block, both mutexes would always get unlocked regardless of the order of the two calls.

Note that if you attempt to acquire any locks between the two unlock calls, this can change the picture completely.

C++11 memory_order_acquire and memory_order_release semantics?

The spinlock mutex implementation looks okay to me. I think they got the definitions of acquire and release completely wrong.

Here is the clearest explanation of acquire/release consistency models that I am aware of: Gharachorloo; Lenoski; Laudon; Gibbons; Gupta; Hennessy: Memory consistency and event ordering in scalable shared-memory multiprocessors, Int'l Symp Comp Arch, ISCA(17):15-26, 1990, doi 10.1145/325096.325102. (The doi is behind the ACM paywall. The actual link is to a copy not behind a paywall.)

Look at Condition 3.1 in Section 3.3 and the accompanying Figure 3:

before an ordinary load or store access is allowed
to perform with respect to any other processor,
all previous acquire accesses must be performed, and
before a release access is allowed to perform with
respect to any other processor, all previous ordinary
load and store accesses must be performed, and
special accesses are [sequentially] consistent with respect
to one another.

The point is this: acquires and releases are sequentially consistent¹ (all threads globally agree on the order in which acquires and releases happened.) All threads globally agree that the stuff that happens between an acquire and a release on a specific thread happened between the acquire and release. But normal loads and stores after a release are allowed to be moved (either by hardware or the compiler) above the release, and normal loads and stores before an acquire are allowed to be moved (either by hardware or the compiler) to after the acquire.

(Footnote 1: This is true for most implementations, but an overstatement for ISO C++ in general. Reader threads are allowed to disagree about the order of 2 stores done by 2 other threads. See Acquire/release semantics with 4 threads, and this answer for details of how C++ compiled for POWER CPUs demonstrates the difference in practice with release and acquire, but not seq_cst. But most CPUs do only get data between cores via coherent cache that means a global order does exist.)

In the C++ standard (I used the link to the Jan 2012 draft) the relevant section is 1.10 (pages 11 through 14).

The definition of happens-before is intended to be modeled after Lamport; Time, Clocks, and the Ordering of Events in a Distributed System, CACM, 21(7):558-565, Jul 1978. C++ acquires correspond to Lamport's receives, C++ releases correspond to Lamport's sends. Lamport placed a total order on the sequence of events within a single thread, where C++ has to allow a partial order (see Section 1.9, Paragraphs 13-15, page 10 for the C++ definition of sequenced-before.) Still, the sequenced-before ordering is pretty much what you would expect. Statements are sequenced in the order they are given in the program. Section 1.9, paragraph 14: "Every value computation and side eﬀect associated with a full-expression is sequenced before every value
computation and side eﬀect associated with the next full-expression to be evaluated."

The whole point of Section 1.10 is to say that a program that is data-race-free produces the same well defined value as if the program were run on a machine with a sequentially consistent memory and no compiler reordering. If there is a data race then the program has no defined semantics at all. If there is no data race then the compiler (or machine) is permitted to reorder operations that don't contribute to the illusion of sequential consistency.

Section 1.10, Paragraph 21 (page 14) says: A program is not data-race-free if there is a pair of accesses A and B from different threads to object X, at least one of those accesses has a side effect, and neither A happens-before B, nor B happens-before A. Otherwise the program is data-race-free.

Paragraphs 6-20 give a very careful definition of the happens-before relation. The key definition is Paragraph 12:

"An evaluation A happens before an evaluation B if:

A is sequenced before B, or
A inter-thread happens before B."

So if an acquire is sequenced before (in the same thread) pretty much any other statement, then the acquire must appear to happen before that statement. (Including if that statement performs a write.)

Likewise: if pretty much any statement is sequenced before (in the same thread) a release, then that statement must appear to happen before the release. (Including if that statement just does a value computation (read).)

The reason that the compiler is allowed to move other computations from after a release to before a release (or from before an acquire to after an acquire) is because of the fact that those operations specifically do not have an inter-thread happens before relationship (because they are outside the critical section). If they race the semantics are undefined, and if they don't race (because they aren't shared) then you can't tell exactly when they happened with regard to the synchronization.

Which is a very long way of saying: cppreference.com's definitions of acquire and release are dead wrong. Your example program has no data race condition, and PANIC can not occur.

Does an atomic acquire synchronize with mutex lock release?

will the atomic acquire synchronize with the mutex unlock which is a "release" operation?

No, in order for an acquire operation to synchronize-with a release operation, the acquire operation has to observe the changes of the release operation (or some change in a potential release sequence headed by that operation).

So yes, you need the atomic store inside the lock. There is no guarantee that get will "see" the latest value from put since you only use acquire/release, so there is not total order between the store and load operations. If you want that guarantee you have to use memory_order_seq_cst.

As a side-note - this implementation is most likely not lock-free, because in most library implementations atomic_load_explicit for shared_ptr is not lock-free. The problem is that you have to load the pointer and dereference that pointer to increment the ref-counter, in one atomic operation. This is not possible on most architectures, so atomic_load_explicit is usually implemented using a lock.

What Is The Mutex Acquire and Release Order