Futex Based Locking Mechanism

Futex based locking mechanism

Pthreads' mutexes are implemented using futexes on recent versions of Linux. Pthreads is the standard C threading API on Linux, and is part of the Posix standard, so you can easily port your program to other Unix-like systems. You should avoid using futexes directly unless you have very unusual needs, because they're very hard to use correctly - use pthreads, or a higher-level, language-specific API (which will almost certainly use pthreads itself).

How to achieve lock-free, but blocking behavior?

If you're on Linux, look into using a Futex. It provides the performance of a non-locking implementation by using atomic operations rather than kernel calls like a mutex would, but should you need to set the process to idle because of some condition not being true (i.e., lock-contention), it will then make the appropriate kernel calls to put the process to sleep and wake it back up at a future event. It's basically like a very fast semaphore.

C++20 mutex with atomic wait

A lock-free mutex is a contradiction.

You can build a lock out of lock-free building-blocks, and in fact that's the normal thing to do whether it's hand-written in asm or with std::atomic.

But the overall locking algorithm is by definition not lock-free. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). The entire point is to stop other threads from making forward progress while one thread is in the critical section, even if it unfortunately sleeps while it's there.

I mean that the implementation could be lock free on the happy path (locking from not locked state)

std::mutex::lock() is that way too: it doesn't block if it doesn't have to! It might need to make a system call like futex(FUTEX_WAIT_PRIVATE) if there's a thread waiting for the lock. But so does an implementation that used std::notify.

Perhaps you haven't understood what "lock-free" means: it never blocks, regardless of what other threads are doing. That is all. It doesn't mean "faster in the simple/easy case". For a complex algorithm (e.g. a queue), it's often faster to just give up and block if the circular buffer is full, rather than adding overhead to the simple case to allow other threads to "assist" or cancel a stuck partial operation. (Lock-free Progress Guarantees in a circular buffer queue)

There is no inherent advantage to rolling your own std::mutex out of std::atomic. The compiler-generated machine code has to do approximately the same things either way, and I'd expect the fast path to be about the same. The only difference would be in choice of what to do in the already-locked case. (But maybe tricky to design a way to avoid calling notify iff there were no waiters, something that actual std::mutex manages in the glibc / pthreads implementation Linux.)

(I'm assuming the overhead of the library function call is negligible compared to the cost of an atomic RMW to take the mutex. Having that inline into your code is a pretty minor advantage.)

A mutex implementation can be tuned for certain use-cases, in terms of how long it spin-waits before sleeping (using an OS-assisted mechanism like futex to enable other threads to wake it when releasing the lock), and in exponential backoff for the spin-wait portion.

If std::mutex doesn't perform well for your application on the hardware you care about, it's worth considering an alternative. Although IDK exactly how you'd go about measuring whether it worked well or not. Perhaps if you could figure out that it was deciding to sleep

And yes, you could consider rolling your own with std::atomic now that there's a portable mechanism to hopefully expose a way to fall-back to OS-assisted sleep/wake mechanisms like futex. You'd still want to manually use system-specific things like x86 _mm_pause() inside a spin-wait loop, though, since I don't think C++ has anything equivalent to Rust's std::hint::spin_loop() that an implementation can use to expose things like the x86 pause instruction, intended for use in the body of a spin-loop. (See Locks around memory manipulation via inline assembly re: such considerations, and spinning read-only instead of spamming atomic RMW attempts. And for a look at the necessary parts of a spinlock in x86 assembly language, which are the same whether you get a C++ compiler to generate that machine code for you or not.)

See also https://rigtorp.se/spinlock/ re: implementing a mutex in C++ with std::atomic.

Linux / libstdc++ behaviour of notify/wait

I tested what system calls std::wait makes when it has to wait for a long time, on Arch Linux (glibc 2.33).

std::mutex lock no-contention fast path, and unlock with no waiters: zero system calls, purely user-space atomic operations. Notably is able to detect that there are no waiters when unlocking, so it doesn't make a FUTEX_WAKE system call (which otherwise would maybe a hundred times more than taking and releasing an uncontended mutex that was still hot in this cores L1d cache.)
std::mutex lock() on already locked: only a futex(0x55be9461b0c0, FUTEX_WAIT_PRIVATE, 2, NULL) system call. Possibly some spinning in user-space before that; I didn't single-step into it with GDB, but if so probably with a pause instruction.
std::mutex unlock() with a waiter: uses futex(0x55ef4af060c0, FUTEX_WAKE_PRIVATE, 1) = 1. (After an atomic RMW, IIRC; not sure why it doesn't just use a release store.)
std::notify_one: always a futex(address, FUTEX_WAKE, 1) even if there are no waiters, so it's up to you to avoid it if there are no waiters when you unlock a lock.
std::wait: spinning a few times in user-space, including 4x sched_yield() calls before a futex(addr, FUTEX_WAIT, old_val, NULL).

Note the use of FUTEX_WAIT instead of FUTEX_WAIT_PRIVATE by the wait/notify functions: these should work across processes on shared memory. The futex(2) man page says the _PRIVATE versions (only for threads of a single process) allow some additional optimizations.

I don't know about other systems, although I've heard some kinds of locks on Windows / MSVC have to suck (always a syscall even on the fast path) for backwards-compat with some ABI choices, or something. Like perhaps std::lock_guard is slow on MSVC, but std::unique_lock isn't?

Test code:

#include <atomic>
#include <thread>
#include <unistd.h>   // for quick & dirty sleep and usleep.  TODO: port to non-POSIX.
#include <mutex>

#if 1        // test std::atomic
//std::atomic_unsigned_lock_free gvar;
//std::atomic_uint_fast_wait_t gvar;
std::atomic<unsigned> gvar;

void waiter(){
    volatile unsigned sink;
    while (1) {
        sink = gvar;
        gvar.wait(sink);   // block until it's not the old value.
        // on Linux/glibc, 4x sched_yield(), then futex(0x562c3c4881c0, FUTEX_WAIT, 46, NULL )  or whatever the current counter value is
    }
}

void notifier(){
    while(1){
        sleep(1);
        gvar.store(gvar.load(std::memory_order_relaxed)+1, std::memory_order_relaxed);
        gvar.notify_one();
    }
}
#else
std::mutex mut;

void waiter(){
    unsigned sink = 0;
    while (1) {
        mut.lock();     // Linux/glibc2.33 - just a futex system call, maybe some spinning in user-space first. But no sched_yield
        mut.unlock();
        sink++;
        usleep(sink);  // give the other thread plenty of time to take the lock, so we don't get it twice.
    }
}

void notifier(){
    while(1){
        mut.lock();
        sleep(1);
        mut.unlock();
    }
}
#endif

int main(){
    std::thread t (waiter);   // comment this to test the no-contention case
    notifier();
    // loops forever: kill it with control-C
}

Compile with g++ -Og -std=gnu++20 notifywait.cpp -pthread, run with strace -f ./a.out to see the system calls. (A couple or a few per second, since I used a nice long sleep.)

If there's any spin-waiting in user-space, it's negligible compared to the 1 second sleep interval, so it uses about a millisecond of CPU time including startup, to run for 19 iterations. (perf stat ./a.out)

Usually your time would be better spent trying to reduce the amount of locking involved, or the amount of contention, rather than trying to optimize the locks themselves. Locking is an extremely important thing, and lots of engineering has gone into tuning it for most use-cases.

If you're rolling your own lock, you probably want to get your hands dirty with system-specific stuff, because it's all a matter of tuning choices. Different systems are unlikely to have made the same tuning choices for std::mutex and wait as Linux/glibc. Unless std::wait's retry strategy on the only system you care about happens to be perfect for your use-case.

It doesn't make sense to roll your own mutex without first investigating exactly what std::mutex does on your system, e.g. single-step the asm for the already-locked case to see what retries it makes. Then you'll have a better idea whether you can do any better.

How to change the locking mechanism in Alternative PHP Cache (APC)?

Yes, they are included in source code available at http://pecl.php.net/package/APC.

Note that you have to choose this at compilation time, more precisely: at ./configure time. Here are the relevant options of ./configure:

--enable-apc-sem            Enable semaphore locks instead of fcntl
--disable-apc-pthreadmutex  Disable pthread mutex locking
--enable-apc-spinlocks      Enable spin locks  EXPERIMENTAL

As you see, pthread mutex locking is already the default now.

How do mutex lock waits for unlock at low level?

It's platform dependent. Typically there is a spin lock portion which falls back to blocking in the operating system if a fixed spin limit is reached.

The spinlock is typically implemented by reading a memory address that contains a particular value when the mutex is unlocked. If it is seen as unlocked, an attempt is made to atomically change that value from the unlocked value to the locked value. If that atomic exchange succeeds, the mutex is locked. Typically the number of spins is counted and if a limit is reached, we switch to blocking in the OS.

The block in the OS is typically implemented much the same way except that instead of sleeping, the thread adds itself to a list of things waiting for the lock. When a thread releases the lock, it checks if anything is waiting in the OS and, if so, unblocks it. This causes the OS to schedule that thread. It typically then tries to perform the same atomic exchange that a spinlock would try, blocking again in the OS if it fails.

In pseudo-code:

Lock:

Check the memory location to see if the lock is locked. If so, go to step 3.
Try to atomically switch the memory location from unlocked to locked. If we succeed, stop, we hold the lock.
Increment the spin count. If we haven't spun too many times, go to step 1.
Atomically increment the number of threads waiting for this lock.
Try to atomically switch the memory location from unlocked to locked. If we succeed, decrement the number of waiting threads and stop, we hold the lock.
Conditionally block in the OS.
Go to step 5.

Unlock:

Atomically set the memory location holding the lock state to unlocked.
If the number of threads waiting for this lock in the OS is greater than zero, tell the OS to unblock any threads waiting for this lock.

Note that the OS has to implement some mechanism to avoid the race where the request to unblock any threads waiting in the OS happens just before a thread manages to block. The method varies from OS to OS. For example, Linux has something called a "futex", which is essentially a way to implement steps 4, 5, and 6 of the locking pseudo-code atomically.

Caution: If you attempt to implement this algorithm in code, understand that you will likely produce a toy that will not perform nearly as well as a proper implementation. You need deep, platform-specific knowledge to avoid nasty performance-sucking traps you can fall into. For example, it's easy to code a spinlock such that it steals core execution resources from another thread sharing the physical core in CPUs with hyper-threading. And it's easy to code the successful exchange so that the CPU's branch prediction predicts it will fail and you take a horrible branch misprediction penalty when you acquire the lock.

Futex Based Locking Mechanism