C++11 Implementation of Spinlock Using <Atomic>

C++11 Implementation of Spinlock using header `atomic`

The problem is that compare_exchange_weak updates the unlatched variable once it fails. From the documentation of compare_exchange_weak:

Compares the contents of the atomic object's contained value with
expected:
- if true, it replaces the contained value with val (like store).
- if false, it replaces expected with the contained value .

I.e., after the first failing compare_exchange_weak, unlatched will be updated to true, so the next loop iteration will try to compare_exchange_weak true with true. This succeeds and you just took a lock that was held by another thread.

Solution:
Make sure to set unlatched back to false before each compare_exchange_weak, e.g.:

while(!latch.compare_exchange_weak(unlatched, true, std::memory_order_acquire)) {
    unlatched = false;
}

How do I implement a simple spinlock on std::atomicT so that the compiler doesn't optimize it out?

The compiler is not allowed to optimize out this_thread::yield() unless it can determine that it's free of side-effects -- which it can't, because it's not.

The atomic load operation can't be optimized away, because it includes a memory barrier and is specifically defined to pick up modifications made by other threads.

Basic spin-lock mutex implementation ordering

Could someone show me, using the C++ standard notions, that the code is perfectly safe?

I initially had the same concerns as you. I think the key is understanding that operations on the std::atomic_flag variable are atomic with respect to all processors/cores. Two atomic 'test and set' operations in separate threads cannot simultaneously succeed, regardless of the memory ordering specified, since they then could not be atomic; the operation must apply to the actual variable, not a cached local copy (which is, I think, not even a concept in C++).

A full chain of reasoning:

29.7 p5 (talking about the test-and-set operation):

Effects: Atomically sets the value pointed to by object or by this to true. Memory is affected according to the value of order. These operations are atomic read-modify-write operations (1.10).
Returns: Atomically, the value of the object immediately before the effects.

1.10 p6:

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M ...

So, if as this case two threads attempt to lock the spinlock at the same time, one of them must be first and the other one second. We now just need to show that the second one will by necessity return that the flag was already set, thus preventing that thread from entering the critical section.

Paragraph 6 goes on to say:

... If A and B are modifications of an atomic object M and A happens before (as defined below) B, then A shall precede B in the modification order of M , which is defined below. [ Note: This states that the modification orders must respect the “happens before” relationship. — end note ]

There is no "happens before" relationship between the two test-and-set operations happening in the two threads, so we cannot determine which comes first in the modification order; however, due to the first sentence in p6 (which states that there is a total ordering), one must definitely come before the other. Now, from 29.3 p12:

Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.

This shows that the test-and-set ordered second must see the value written by the test-and-set ordered first. Any acquire/release choices do not affect this.

Therefore, if two test-and-set operations are performed "simultaneously", they will be given an arbitrary order, and the second shall see the flag value that was set by the first. As such the memory order constraints specified for the test-and-set operation do not matter; they are used to control ordering of writes to other variables during the period where the spinlock is acquired.

Response to "Update 2" of the question:

So according to that clause for an RMW operation to have the latest written value, the latest write operation should happen before the reading part or RMW operation. Which is not the case in the question. Right?

Correct that there is no "happen before" relationship, but incorrect that an RMW operation needs such a relationship in order to be guaranteed the latest written value. The statement you list as "[atomics.order] clause 11" does not require a "happens before" relationship, just that one operation is before the other in the "modification order" for the atomic flag. Clause 8 states that there will be such an order, and it will be a total ordering:

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M ...

... it then goes on to say that the total ordering must be consistent with any "happens before" relationships:

... If A and B are modifications of an atomic object M and A happens before (as defined below) B, then A shall precede B in the modification order of M, which is defined below.

However, in the absence of a "happens before" relationship, there is still a total ordering - it's just that this ordering has a degree of arbitrariness. That is, if there is no "happens before" relationship between A and B, then it is not specified whether A is ordered before or after B. But it must be one or the other, because there is a particular total order.

Why is memory_order_acquire needed, then?

A mutex such as a spinlock is often used to protect other, non-atomic variables and data structures. Using memory_order_acquire when locking the spinlock assures that a read from such variables will see the correct values (i.e. the values written by any other thread that previously held the spinlock). For the unlock, memory_order_release is also needed in order to allow other threads to see the written values.

The acquire/release both prevent the compiler from re-ordering reads/writes past the acquisition/release of the lock, and ensure that any necessary instructions to ensure appropriate levels of cache coherency are generated.

Further evidence:

First, this note from 29.3:

Note: Atomic operations specifying memory_order_relaxed are relaxed with respect to memory ordering. Implementations must still guarantee that any given atomic access to a particular atomic object be indivisible with respect to all other atomic accesses to that object. — end note

This is essentially saying that the memory ordering specified does not affect the atomic operation itself. The access must "be indivisible with respect to all other atomic accesses" including those from other threads. To allow two test-and-set operations to read the same value would effectively be dividing at least one of them, so that it was no longer atomic.

Also, from 1.10 paragraph 5:

In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics.

(A test-and-set falls into this latter category) and especially:

“Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races.

(emphasis mine). A case where two threads both simultaneously executed an atomic test-and-set (and both performed the 'set' part) would be such a data race, so this text again indicates that this cannot happen.

1.10 p8:

Note: The specifications of the synchronization operations define when one reads the value written by another. For atomic objects, the definition is clear.

It means one thread reads the value written by another. It says that for atomic objects the definition is clear, meaning that no other synchronisation is necessary - it is enough to perform the operation on the atomic object; the effect will be seen immediately by other threads.

In particular, 1.10 p19:

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence
guarantee provided by most hardware available to C ++ atomic operations. — end note ]

Note the mention of cache coherence even in the presence of relaxed loads. This clearly shows that the test-and-set can only succeed in one thread at a time, since for one to fail either cache coherence is broken or the operation is not atomic.

C++11 Recursive Spinlock Implementation

As pointed out in the comments, compare_exchange_weak replaces the contents of the first arguments if the condition is false, so lock_is_free is corrupted.

Further, this is not valid code because compare_exchange_weak compares the value of the objects stored in the atomic variable in bitwise order, so as if they were compared using std::memcmp. And an object of type std::thread::id is not an integral type and has a special overload for comparisons. So does pthread_t (pthread_equal), so essentially you are relying on implementation defined behavior.

You can confirm that by running the following code

#include <type_traits>
#include <iostream>
#include <thread>
#include <pthread.h>
using std::cout;
using std::endl;

int main() {
    cout << std::boolalpha << std::is_integral<std::thread::id>::value << endl;
    cout << std::boolalpha << std::is_integral<decltype(pthread_self())>::value
        << endl;

    return 0;
}

Why not just overload memcpy and memcmp to work with std::thread::id in that case? Both cppreference and cplusplus.com confirm that there is no guarantee in. Even overloading memcpy (both globally and in the std namespace) doesn't work on my system with my standard library. I suspect it will not on most other systems (standard library and compiler combination) as well because of two reasons:

compare_exchange_weak might not use memcpy or memcmp under the hood and might roll its own implementation
Although it is not needed casting a pointer to void* before passing it to memcpy or memcmp might choose the default implementation anyway that is not correct

See the code below for an explanation of what I meant in the 2nd point

void func(void*) {
    cout << "void*" << endl;
}
void func(int*) {
    cout << "int*" << endl;
}

int main() {
    int a = 1;
    func(&a);
    func(reinterpret_cast<void*>(a));

    return 0;
}

To answer your question, using compare_exchange_weak with std::thread::id isn't correct ("not portable" might be better) code.

Writing a (spinning) thread barrier using c++11 atomics

It looks needlessly complicated. Try this simpler version (well, I haven't tested it, I just meditated on it:))) :

#include <atomic>

class spinning_barrier
{
public:
    spinning_barrier (unsigned int n) : n_ (n), nwait_ (0), step_(0) {}

    bool wait ()
    {
        unsigned int step = step_.load ();

        if (nwait_.fetch_add (1) == n_ - 1)
        {
            /* OK, last thread to come.  */
            nwait_.store (0); // XXX: maybe can use relaxed ordering here ??
            step_.fetch_add (1);
            return true;
        }
        else
        {
            /* Run in circles and scream like a little girl.  */
            while (step_.load () == step)
                ;
            return false;
        }
    }

protected:
    /* Number of synchronized threads. */
    const unsigned int n_;

    /* Number of threads currently spinning.  */
    std::atomic<unsigned int> nwait_;

    /* Number of barrier syncronizations completed so far, 
     * it's OK to wrap.  */
    std::atomic<unsigned int> step_;
};

EDIT:
@Grizzy, I can't find any errors in your first (C++11) version and I've also run it for like a hundred million syncs with two threads and it completes. I've run it on a dual-socket/quad-core GNU/Linux machine though, so I'm rather inclined to suspect your option 3. - the library (or rather, its port to win32) is not mature enough.

data race in simple spinlock using atomicbool

You need to read the documentation for compare_exchange_weak. The problem is that you need to re-set temp to false since true is written to temp by the compare-exchange operation. See e.g. example (mis) use of compare-exchange-weak

In that link, also read what is said about disabling speculative execution using the pause instruction.

Implementing spin-lock without XCHG?

Is it possible to implement spin locking without XCHG?

Yes. For 80x86, you can lock bts or lock cmpxchg or lock xadd or ...

What is the fastest possible spin lock?

Possible interpretations of "fast" include:

a) Fast in the uncontended case. In this case it's not going to matter very much what you do because most of the possible operations (exchanging, adding, testing...) are cheap and the real costs are cache coherency (getting the cache line containing the lock into the "exclusive" state in the current CPU's cache, possibly including fetching it from RAM or other CPUs' caches) and serialization.

b) Fast in the contended case. In this case you really need a "test without lock; then test & set with lock" approach. The main problem with a simple spinloop (for the contended case) is that when multiple CPUs are spinning the cache line will be rapidly bouncing from one CPU's cache to the next and consuming a huge amount of inter-connect bandwidth for nothing. To prevent this, you'd have a loop that tests the lock state without modifying it so that the cache line can remain in all CPUs caches as "shared" at the same time while those CPUs are spinning.

But note that testing read-only to start with can hurt the un-contended case, resulting in more coherency traffic: first a share-request for the cache line which will only get you MESI S state if another core had recently unlocked, and then an RFO (Read For Ownership) when you do try to take the lock. So best practice is probably to start with an RMW, and if that fails then spin read-only with pause until you see the lock available, unless profiling your code on the system you care about shows a different choice is better.

c) Fast to exit the spinloop (after contention) when the lock is acquired. In this case CPU can speculatively execute many iterations of the loop, and when the lock becomes acquired all the CPU has to drain those "speculatively execute many iterations of the loop" which costs a little time. To prevent that you want a pause instruction to prevent many iterations of the loop/s from being speculatively executed.

d) Fast for other CPUs that don't touch the lock. For some cases (hyper-threading) the core's resources are shared between logical processors; and when one logical process is spinning it consumes resources that the other logical processor could've used to get useful work done (especially for the "spinlock speculatively executes many iterations of the loop" situation). To minimize this you need a pause in the spinloop/s (so that the spinning logical processor doesn't consume as much of the core's resources and the other logical processor in the core can get more useful work done).

e) Minimum "worst case time to acquire". With a simple lock, under contention, some CPUs or threads can be lucky and always get the lock while other CPUs/threads are very unlucky and take ages to get the lock; and the "worst case time to acquire" is theoretically infinite (a CPU can spin forever). To fix that you need a fair lock - something to ensure that only the thread that has been waiting/spinning for the longest amount of time is able to acquire the lock when it is released. Note that it's possible to design a fair lock such that each thread spins on a different cache line; which is a different way to solve the "cache line bouncing between CPUs" problem I mentioned in "b) Fast in the contended case".

f) Minimum "worst case until lock released". This has to involve the length of the worst critical section; but in some situations may also include the cost of any number IRQs, the cost of any number of task switches and the time the code isn't using any CPU. It's entirely possible to have a situation where a thread acquires the lock then the scheduler does a thread switch; then many CPUs all spin (wasting a huge amount of time) on a lock that can not be released (because the lock holder is the only one that can release the lock and it isn't even using any CPU). The way to fix/improve this is to disable the scheduler and IRQs; which is fine in kernel code, but "likely impossible for security reasons" in normal user-space code. This is also the reason why spinlocks should probably never be used in user-space (and why user-space should probably use a mutex where the thread is put in a "blocked waiting for lock" state and not given CPU time by the scheduler until/unless the thread actually can acquire the lock).

Note that making it fast for one possible interpretation of "fast" can make it slower/worse for other interpretations of "fast". For example; the speed of the uncontended case is made worse by everything else.

Example Spinlock

This example is untested, and written in (NASM syntax) assembly.

;Input
; ebx = address of lock

;Initial optimism in the hope the lock isn't contended
spinlock_acquire:
    lock bts dword [ebx],0      ;Set the lowest bit and get its previous value in carry flag
                                ;Did we actually acquire it, i.e. was it previously 0 = unlocked?
    jnc .acquired               ; Yes, done!

;Waiting (without modifying) to avoid "cache line bouncing"

.spin:
    pause                       ;Reduce resource consumption
                                ; and avoid memory order mis-speculation when the lock becomes available.
    test dword [ebx],1          ;Has the lock been released?
    jne .spin                   ; no, wait until it was released

;Try to acquire again

    lock bts dword [ebx],0      ;Set the lowest bit and get its previous value in carry flag
                                ;Did we actually acquire it?
    jc .spin                    ; No, go back to waiting

.acquired:

Spin-unlock can be just mov dword [ebx], 0, not lock btr, because you know you own the lock and that has release semantics on x86. You could read it first to catch double-unlock bugs.

Notes:

a) lock bts is a little slower than other possibilities; but it doesn't interfere with or depend on the other 31 bits (or 63 bits) of the lock, which means that those other bits can be used for detecting programming mistakes (e.g. store 31 bits of "thread ID that currently holds lock" in them when the lock is acquired and check them when the lock is released to auto-detect "Wrong thread releasing lock" and "Lock being released when it was never acquired" bugs) and/or used for gathering performance information (e.g. set bit 1 when there's contention so that other code can scan periodically to determine which locks are rarely contended and which locks are heavily contended). Bugs with the use of locks are often extremely insidious and hard to find (unpredictable and unreproducible "Heisenbugs" that disappear as soon as you try to find them); so I have a preference for "slower with automatic bug detection".

b) This is not a fair lock, which means its not well suited to situations where contention is likely.

c) For memory; there's a compromise between memory consumption/cache misses, and false sharing. For rarely contended locks I like to put the lock in the same cache line/s as the data the lock protects, so that the acquiring the lock means that the data the lock holder wants is already in the cache (and no subsequent cache miss occurs). For heavily contended locks this causes false sharing and should be avoided by reserving the whole cache line for the lock and nothing else (e.g. by adding 60 bytes of unused padding after the 4 bytes used by the actual lock, like in C++ alignas(64) struct { std::atomic<int> lock; }; ). Of course a spinlock like this shouldn't be used for heavily contended locks so its reasonable to assume that minimizing memory consumption (and not having any padding, and not caring about false sharing) makes sense.

Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead

In that case I'd suggest trying to replace locks with atomics, block-free algorithms, and lock-free algorithms. A trivial example is tracking statistics, where you might want to do lock inc dword [number_of_chickens] instead of acquiring a lock to increase "number_of_chickens".

Beyond that it's hard to say much - for one extreme, the program could be spending most of its time doing work without needing locks and the cost of locking may have almost no impact on overall performance (even though acquire/release is more expensive than the tiny critical sections); and for the other extreme the program could be spending most of its time acquiring and releasing locks. In other words, the cost of acquiring/releasing locks is somewhere between "irrelevant" and "major design flaw (using far too many locks and needing to redesign the entire program)".

C++ simple mutex using atomic_flag (code not working)

You are not joining threads a and b. As the result, they might be still running while your program is finishing its execution.

You should either add a.join() and b.join() somewhere, or probably just remove them as the assertion in your main function will fail if you keep them.

Another issue is that you need to explicitly initialize atomic_flag instance in your mutex constructor. It might not cause issues in your example because global variables are zero-initialized, but this might cause issues later.

Is my spin lock implementation correct and optimal?

So I'm wondering:

* Is it correct?

In the context mentioned, I would say yes.

* Is it optimal?

That's a loaded question. By reinventing the wheel you are also reinventing a lot of problems that have been solved by other implementations

I'd expect a waste loop on failure where you aren't trying to access the lock word.
Use of a full barrier in the unlock only needs to have release semantics (that's why you'd use __sync_lock_release, so that you'd get st1.rel on itanium instead of mf, or a lwsync on powerpc, ...). If you really only care about x86 or x86_64 the types of barriers used here or not don't matter as much (but if you where to make the jump to intel's itanium for an HP-IPF port then you wouldn't want this).
you don't have the pause() instruction that you'd normally put before your waste loop.
when there is contention you want something, semop, or even a dumb sleep in desperation. If you really need the performance that this buys you then the futex suggestion is probably a good one. If you need the performance this buys you bad enough to maintain this code you have a lot of research to do.

Note that there was a comment saying that the release barrier wasn't required. That isn't true even on x86 because the release barrier also serves as an instruction to the compiler to not shuffle other memory accesses around the "barrier". Very much like what you'd get if you used asm ("" ::: "memory" ).

* on compare and swap

On x86 the sync_lock_test_and_set will map to a xchg instruction which has an implied lock prefix. Definitely the most compact generated code (esp. if you use a byte for the "lock word" instead of an int), but no less correct than if you used LOCK CMPXCHG. Use of compare and swap can be used for fancier algorthims (like putting a non-zero pointer to metadata for the first "waiter" into the lockword on failure).