Fastest Inline-Assembly Spinlock

Fastest inline-assembly spinlock

Just look here:
x86 spinlock using cmpxchg

And thanks to Cory Nelson

__asm{
spin_lock:
xorl %ecx, %ecx
incl %ecx
spin_lock_retry:
xorl %eax, %eax
lock; cmpxchgl %ecx, (lock_addr)
jnz spin_lock_retry
ret

spin_unlock:
movl $0 (lock_addr)
ret
}

And another source says:
http://www.geoffchappell.com/studies/windows/km/cpu/cx8.htm

       lock    cmpxchg8b qword ptr [esi]
is replaceable with the following sequence

try:
        lock    bts dword ptr [edi],0
        jnb     acquired
wait:
        test    dword ptr [edi],1
        je      try
        pause                   ; if available
        jmp     wait

acquired:
        cmp     eax,[esi]
        jne     fail
        cmp     edx,[esi+4]
        je      exchange

fail:
        mov     eax,[esi]
        mov     edx,[esi+4]
        jmp     done

exchange:
        mov     [esi],ebx
        mov     [esi+4],ecx

done:
        mov     byte ptr [edi],0

And here is a discussion about lock-free vs lock implementations:
http://newsgroups.derkeiler.com/Archive/Comp/comp.programming.threads/2011-10/msg00009.html

What is the minimum X86 assembly needed for a spinlock

Shortest would probably be:

acquire:
    lock bts [eax],0
    jc acquire

release:
    mov [eax],0

For performance, it's best to use a "test, test and set" approach, and use pause, like this:

acquire:
    lock bts [eax],0    ;Optimistic first attempt
    jnc l2              ;Success if acquired
l1:
    pause
    test [eax],1        
    jne l1              ;Don't attempt again unless there's a chance

    lock bts [eax],0    ;Attempt to acquire
    jc l1               ;Wait again if failed

l2:

release:
    mov [eax],0

For debugging, you can add extra data to make it easier to detect problems, like this:

acquire:
    lock bts [eax],31         ;Optimistic first attempt
    jnc l2                    ;Success if acquired

    mov ebx,[CPUnumber]
    lea ebx,[ebx+0x80000000]
    cmp [eax],ebx             ;Is the lock acquired by this CPU?
    je .bad                   ; yes, deadlock
    lock inc dword [eax+4]    ;Increase "lock contention counter"
l1:
    pause
    test [eax],0x80000000        
    jne l1                    ;Don't attempt again unless there's a chance

    lock bts [eax],31         ;Attempt to acquire
    jc l1                     ;Wait again if failed

l2: mov [eax],ebx             ;Store CPU number

release:
    mov ebx,[CPUnumber]
    lea ebx,[ebx+0x80000000]
    cmp [eax],ebx             ;Is lock acquired, and is CPU same?
    jne .bad                  ; no, either not acquired or wrong CPU
    mov [eax],0

Is there any simple way to improve performance of this spinlock function?

How about something like this (I understand this is the KeAcquireSpinLock implementation). My at&t assembly is weak unfortunately.

spin_lock:
    rep; nop
    test lockValue, 1
    jnz spin_lock
    lock bts lockValue
    jc spin_lock

Locks around memory manipulation via inline assembly

I think a simple spinlock that doesn't have any of the really major / obvious performance problems on x86 is something like this. Of course a real implementation would use a system call (like Linux futex) after spinning for a while, and unlocking would have to check if it needs to notify any waiters with another system call. This is important; you don't want to spin forever wasting CPU time (and energy / heat) doing nothing. But conceptually this is the spin part of a spinlock before you take the fallback path. It's an important piece of how light-weight locking is implemented. (Only attempting to take the lock once before calling the kernel would be a valid choice, instead of spinning at all.)

Implement as much of this as you like in inline asm, or preferably using C11 stdatomic, like this semaphore implementation. This is NASM syntax. In GNU C, make sure you use a "memory" clobber to stop compile-time reordering of memory access (TTAS coherence issue?)

;;; UNTESTED ;;;;;;;;
;;; TODO: **IMPORTANT** fall back to OS-supported sleep/wakeup after spinning some
;;; e.g. Linux futex
    ; first arg in rdi as per AMD64 SysV ABI (Linux / Mac / etc)

;;;;;void spin_lock  (volatile char *lock)
global spin_unlock
spin_unlock:
       ; movzx  eax, byte [rdi]  ; debug check for double-unlocking.  Expect 1
    mov   byte [rdi], 0        ; lock.store(0, std::memory_order_release)
    ret

align 16
;;;;;void spin_unlock(volatile char *lock)
global spin_lock
spin_lock:
    mov   eax, 1                 ; only need to do this the first time, otherwise we know al is non-zero
.retry:
    xchg  al, [rdi]

    test  al,al                  ; check if we actually got the lock
    jnz   .spinloop
    ret                          ; no taken branches on the fast-path

align 8
.spinloop:                    ; do {
    pause
    cmp   byte [rdi], al      ; C++11
    jne   .retry              ; if (lock.load(std::memory_order_acquire) != 1)
    jmp   .spinloop

; if not translating this to inline asm, you could put the spin loop *before* the function entry point, saving the last jmp
; but since this is probably too simplistic for real use, I'm going to leave it as-is.

A plain store has release semantics, but not sequential-consistency (which you'd get from an xchg or something). Acquire/release is enough to protect a critical section (hence the name).

If you were using a bitfield of atomic flags, you could use lock bts (test and set) for the equivalent of xchg-with-1. You can spin on bt or test. To unlock, you'd need lock btr, not just btr, because it would be a non-atomic read-modify-write of the byte, or even the containing 32-bits.

With a byte or int sized lock like you should normally use, you don't even need a locked operation to unlock; release semantics are enough. glibc's pthread_spin_unlock does it the same as my unlock function: a simple store.

(lock bts is not necessary; xchg or lock cmpxchg are just as good if for a normal lock.)

The first access should be an atomic RMW

See discussion on Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? - if the first access is read-only, the CPU might send out just a share request for that cache line. Then, if it sees the line unlocked (the hopefully-common low-contention case) it would have to send out an RFO (Read For Ownership) to actually be able to write the cache line. So that's twice as many off-core transactions.

The downside is that this will take MESI exclusive ownership of that cache line, but what really matters is that the thread owning the lock can efficiently store a 0 so we can see it unlocked. Either way, read-only or RMW, that core will lose exclusive ownership of the line and have to RFO before it can commit that unlocking store.

I think a read-only first access would just optimize for slightly less traffic between cores when multiple threads queue up to wait for a lock that's already taken. That would be a silly thing to optimize for.

(Fastest inline-assembly spinlock also tested the idea for a massively contended spinlock with multiple threads doing nothing but trying to take the lock, with poor results. That linked answer makes some incorrect claims about xchg globally locking a bus - aligned locks don't do that, just a cache lock (Can num++ be atomic for 'int num'?), and each core can be doing a separate atomic RMW on a different cache line at the same time.)

However, if that initial attempt finds it locks, we don't want to keep hammering on the cache line with atomic RMWs. That's when we fall back to read-only. 10 threads all spamming xchg for the same spinlock would keep the memory arbitration hardware pretty busy. It would likely delay the visibility of the store that unlocks (because that thread has to contend for exclusive ownership of the line), so it's directly counter-productive. It may also memory in general in general for other cores.

PAUSE is also essential, to avoid mis-speculation about memory ordering by the CPU. You exit the loop only when the memory you're reading was modified by another core. However, we don't want to pause in the un-contended case. On Skylake, PAUSE waits a lot longer, like ~100 cycles up from ~5, so you should definitely keep the spin-loop separate from the initial check for unlocked.

I'm sure Intel's and AMD's optimization manuals talk about this, see the x86 tag wiki for that and tons of other links.

Not good enough? Should I for example make use of the register keyword in C?

register is a meaningless hint in modern optimizing compilers, except in debug builds (gcc -O0).

x86 spinlock using cmpxchg

You have the right idea, but your asm is broken:

cmpxchg can't work with an immediate operand, only registers.

lock is not a valid prefix for mov. mov to an aligned address is atomic on x86, so you don't need lock anyway.

It has been some time since I've used AT&T syntax, hope I remembered everything:

spin_lock:
    xorl   %ecx, %ecx
    incl   %ecx            # newVal = 1
spin_lock_retry:
    xorl   %eax, %eax      # expected = 0
    lock; cmpxchgl %ecx, (lock_addr)
    jnz    spin_lock_retry
    ret

spin_unlock:
    movl   $0,  (lock_addr)    # atomic release-store
    ret

Note that GCC has atomic builtins, so you don't actually need to use inline asm to accomplish this:

void spin_lock(int *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1));
}

void spin_unlock(int volatile *p)
{
    asm volatile ("":::"memory"); // acts as a memory barrier.
    *p = 0;
}

As Bo says below, locked instructions incur a cost: every one you use must acquire exclusive access to the cache line and lock it down while lock cmpxchg runs, like for a normal store to that cache line but held for the duration of lock cmpxchg execution. This can delay the unlocking thread especially if multiple threads are waiting to take the lock. Even without many CPUs, it's still easy and worth it to optimize around:

void spin_lock(int volatile *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1))
    {
        // spin read-only until a cmpxchg might succeed
        while(*p) _mm_pause();  // or maybe do{}while(*p) to pause first
    }
}

The pause instruction is vital for performance on HyperThreading CPUs when you've got code that spins like this -- it lets the second thread execute while the first thread is spinning. On CPUs which don't support pause, it is treated as a nop.

pause also prevents memory-order mis-speculation when leaving the spin-loop, when it's finally time to do real work again. What is the purpose of the "PAUSE" instruction in x86?

Note that spin locks are actually rarely used: typically, one uses something like a critical section or futex. These integrate a spin lock for performance under low contention, but then fall back to an OS-assisted sleep and notify mechanism. They may also take measures to improve fairness, and lots of other things the cmpxchg / pause loop doesn't do.

Also note that cmpxchg is unnecessary for a simple spinlock: you can use xchg and then check whether the old value was 0 or not. Doing less work inside the locked instruction may keep the cache line pinned for less time. See Locks around memory manipulation via inline assembly for a complete asm implementation using xchg and pause (but still with no fallback to OS-assisted sleep, just spinning indefinitely.)

Implementing spin-lock without XCHG?

Is it possible to implement spin locking without XCHG?

Yes. For 80x86, you can lock bts or lock cmpxchg or lock xadd or ...

What is the fastest possible spin lock?

Possible interpretations of "fast" include:

a) Fast in the uncontended case. In this case it's not going to matter very much what you do because most of the possible operations (exchanging, adding, testing...) are cheap and the real costs are cache coherency (getting the cache line containing the lock into the "exclusive" state in the current CPU's cache, possibly including fetching it from RAM or other CPUs' caches) and serialization.

b) Fast in the contended case. In this case you really need a "test without lock; then test & set with lock" approach. The main problem with a simple spinloop (for the contended case) is that when multiple CPUs are spinning the cache line will be rapidly bouncing from one CPU's cache to the next and consuming a huge amount of inter-connect bandwidth for nothing. To prevent this, you'd have a loop that tests the lock state without modifying it so that the cache line can remain in all CPUs caches as "shared" at the same time while those CPUs are spinning.

But note that testing read-only to start with can hurt the un-contended case, resulting in more coherency traffic: first a share-request for the cache line which will only get you MESI S state if another core had recently unlocked, and then an RFO (Read For Ownership) when you do try to take the lock. So best practice is probably to start with an RMW, and if that fails then spin read-only with pause until you see the lock available, unless profiling your code on the system you care about shows a different choice is better.

c) Fast to exit the spinloop (after contention) when the lock is acquired. In this case CPU can speculatively execute many iterations of the loop, and when the lock becomes acquired all the CPU has to drain those "speculatively execute many iterations of the loop" which costs a little time. To prevent that you want a pause instruction to prevent many iterations of the loop/s from being speculatively executed.

d) Fast for other CPUs that don't touch the lock. For some cases (hyper-threading) the core's resources are shared between logical processors; and when one logical process is spinning it consumes resources that the other logical processor could've used to get useful work done (especially for the "spinlock speculatively executes many iterations of the loop" situation). To minimize this you need a pause in the spinloop/s (so that the spinning logical processor doesn't consume as much of the core's resources and the other logical processor in the core can get more useful work done).

e) Minimum "worst case time to acquire". With a simple lock, under contention, some CPUs or threads can be lucky and always get the lock while other CPUs/threads are very unlucky and take ages to get the lock; and the "worst case time to acquire" is theoretically infinite (a CPU can spin forever). To fix that you need a fair lock - something to ensure that only the thread that has been waiting/spinning for the longest amount of time is able to acquire the lock when it is released. Note that it's possible to design a fair lock such that each thread spins on a different cache line; which is a different way to solve the "cache line bouncing between CPUs" problem I mentioned in "b) Fast in the contended case".

f) Minimum "worst case until lock released". This has to involve the length of the worst critical section; but in some situations may also include the cost of any number IRQs, the cost of any number of task switches and the time the code isn't using any CPU. It's entirely possible to have a situation where a thread acquires the lock then the scheduler does a thread switch; then many CPUs all spin (wasting a huge amount of time) on a lock that can not be released (because the lock holder is the only one that can release the lock and it isn't even using any CPU). The way to fix/improve this is to disable the scheduler and IRQs; which is fine in kernel code, but "likely impossible for security reasons" in normal user-space code. This is also the reason why spinlocks should probably never be used in user-space (and why user-space should probably use a mutex where the thread is put in a "blocked waiting for lock" state and not given CPU time by the scheduler until/unless the thread actually can acquire the lock).

Note that making it fast for one possible interpretation of "fast" can make it slower/worse for other interpretations of "fast". For example; the speed of the uncontended case is made worse by everything else.

Example Spinlock

This example is untested, and written in (NASM syntax) assembly.

;Input
; ebx = address of lock

;Initial optimism in the hope the lock isn't contended
spinlock_acquire:
    lock bts dword [ebx],0      ;Set the lowest bit and get its previous value in carry flag
                                ;Did we actually acquire it, i.e. was it previously 0 = unlocked?
    jnc .acquired               ; Yes, done!

;Waiting (without modifying) to avoid "cache line bouncing"

.spin:
    pause                       ;Reduce resource consumption
                                ; and avoid memory order mis-speculation when the lock becomes available.
    test dword [ebx],1          ;Has the lock been released?
    jne .spin                   ; no, wait until it was released

;Try to acquire again

    lock bts dword [ebx],0      ;Set the lowest bit and get its previous value in carry flag
                                ;Did we actually acquire it?
    jc .spin                    ; No, go back to waiting

.acquired:

Spin-unlock can be just mov dword [ebx], 0, not lock btr, because you know you own the lock and that has release semantics on x86. You could read it first to catch double-unlock bugs.

Notes:

a) lock bts is a little slower than other possibilities; but it doesn't interfere with or depend on the other 31 bits (or 63 bits) of the lock, which means that those other bits can be used for detecting programming mistakes (e.g. store 31 bits of "thread ID that currently holds lock" in them when the lock is acquired and check them when the lock is released to auto-detect "Wrong thread releasing lock" and "Lock being released when it was never acquired" bugs) and/or used for gathering performance information (e.g. set bit 1 when there's contention so that other code can scan periodically to determine which locks are rarely contended and which locks are heavily contended). Bugs with the use of locks are often extremely insidious and hard to find (unpredictable and unreproducible "Heisenbugs" that disappear as soon as you try to find them); so I have a preference for "slower with automatic bug detection".

b) This is not a fair lock, which means its not well suited to situations where contention is likely.

c) For memory; there's a compromise between memory consumption/cache misses, and false sharing. For rarely contended locks I like to put the lock in the same cache line/s as the data the lock protects, so that the acquiring the lock means that the data the lock holder wants is already in the cache (and no subsequent cache miss occurs). For heavily contended locks this causes false sharing and should be avoided by reserving the whole cache line for the lock and nothing else (e.g. by adding 60 bytes of unused padding after the 4 bytes used by the actual lock, like in C++ alignas(64) struct { std::atomic<int> lock; }; ). Of course a spinlock like this shouldn't be used for heavily contended locks so its reasonable to assume that minimizing memory consumption (and not having any padding, and not caring about false sharing) makes sense.

Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead

In that case I'd suggest trying to replace locks with atomics, block-free algorithms, and lock-free algorithms. A trivial example is tracking statistics, where you might want to do lock inc dword [number_of_chickens] instead of acquiring a lock to increase "number_of_chickens".

Beyond that it's hard to say much - for one extreme, the program could be spending most of its time doing work without needing locks and the cost of locking may have almost no impact on overall performance (even though acquire/release is more expensive than the tiny critical sections); and for the other extreme the program could be spending most of its time acquiring and releasing locks. In other words, the cost of acquiring/releasing locks is somewhere between "irrelevant" and "major design flaw (using far too many locks and needing to redesign the entire program)".

Fastest Inline-Assembly Spinlock