Is Std::Mutex Fair

Is std::mutex fair?

The standard (§30.4) does not mention anything about requirements regarding fairness between contending threads on a mutex, so it might be or might not be fair.

In practice std::mutex implementations will probably use whatever mutex implementation their platform provides, which are moslty unfair, as this is usually simpler and more efficient. On windows e.g. mutexes are mostly fair, but not always. Some implementations e.g. Thread Building Block provide special mutexes that are fair, but these are not based on the OSes native mutexes, and are usually implemented as spin-locks (which have their own caveats).

Does std::mutex favor the thread that owns it?

Mutex makes zero guarantees about fairness.

Unlocking a mutex does not suspend your current thread. Attempting to lock a mutex does not say "wait, someone else has been waiting longer, they should get a go at it".

Blocking on a mutex can sometimes put your thread to sleep.

After you unlock a mutex, you aren't "the owning thread". You are probably a running thread. And mutex can (and apparently does) favor running threads over threads that are suspended.

Implementing "fairness" can be done on top of C++ synchronization primitives, but it isn't free, and C++ aims not to make you pay for anything you don't ask for.

When is a mutex prefarable to a condition variable?

You should almost always use a mutex. It should be your default means of synchronization. It's light and efficient. There are two cases where you should use a condition variable:

The very rare case where the mutex will be locked most of the time. Typically, most work done by threads doesn't impact shared data so most threads do most of their work holding no mutexes. In the rare case where a lock would be held most of the time, a mutex is not a good fit. Your example falls into this case.
The less rare case is where you need one thread to wait for another thread to finish something. For example, say you have a cache that holds some data. A thread might acquire a lock on the cache and see if some data is in the cache. If so, it'll use the cached copy. If not, it'll work out the data itself and then put it in the cache. But what if a new thread looks for the data in the cache while another thread is working it out? That thread needs to just wait for the other thread to put the data in the cache. A mutex is a bad fit for making one thread wait for another.

For the typical case where you have some shared data that needs to be briefly accesses by more than one thread, a mutex is the best fit. In the vast majority of cases, the mutex will be unowned when a thread tries to acquire it, so whether it provides fairness or not won't matter at all.

Waiting should be rare. If one thread spends most of its time waiting for another thread, it usually shouldn't be its own thread. In your example, you have two threads where neither can make any progress unless the other is stopped and either can make infinite progress while the other is stopped. That almost never happens in any realistic situation and usually indicates a serious design problem.

If you are worried about fairness, you are doing something wrong. Your code should only do work that you want done. If your code is doing the wrong work, then you didn't code it to do the work you most wanted done. Fix that. Generally speaking, it's the implementation's job to make your code make as much forward progress as possible and it's your job to make your code make the right forward progress.

Here's a quick and dirty implementation of a fair lock that checks if another thread is already waiting for the lock and gives it a chance to acquire the lock:

#include <mutex>
#include <thread>
#include <condition_variable>
#include <iostream>

class fair_lock
{
    private:

    std::mutex m;
    std::condition_variable cv;
    int locked = 0;
    int waiter_count = 0;

    public:

    void lock()
    {
        std::unique_lock<std::mutex> lk(m);

        ++waiter_count;

        // if someone was already waiting, give them a turn
        if (waiter_count > 1)
            cv.wait(lk);

        // wait for lock to be unlocked
        while (locked != 0)
            cv.wait(lk);

        --waiter_count;
        locked = 1;
    }

    void unlock()
    {
        std::unique_lock<std::mutex> lk(m);
        locked = 0;
        cv.notify_all();
    }
};

int main ()
{
    fair_lock m;

    std::thread t ([&] ()
    {
        while (true)
        {
            std::unique_lock<fair_lock> lk(m);
            std::cerr << "#";
            std::cerr.flush ();
        }
    });

    while (true)
    {
        std::unique_lock<fair_lock> lk(m);
        std::cerr << ".";
        std::cerr.flush ();
    }
}

..#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.#.

Notice the two dots at the beginning? One thread was able to run twice before the other thread got started. This "fair" lock allows one thread to continue to make forward progress if no other thread is waiting.

This implementation only holds the mutex while the fair lock is being acquired or released, so contention on the mutex is minimal. The condition variable is used to allow one thread to wait for another to make forward progress. The code itself ensures that the desired thread makes forward progress (by blocking the thread we don't want to make forward progress), properly allowing the implementation to focus on allowing the code to make as much forward progress as possible.

Do mutex locks happen in the same order they are asked?

No, you cannot guarantee that your thread loop will ever acquire the lock with your example as is. Use a conditional variable to signal to the thread loop that it should awake and take the lock. See std::condition_variable::wait(...).
condition-variable

More on this topic in general can be found here http://en.wikipedia.org/wiki/Condition_variable. If you were using the pthread library, the equivalent call would be pthread_cond_wait in your "Thread loop" and pthread_cond_signal in your runTask function.

Why does a redundant extra scoping block affect std::lock_guard behaviour?

I tested your code (with block scope removed) on Linux with GCC (7.3.0) using pthreads and got similar results as you. The main thread is starved, although if I waited long enough, I would occasionally see main thread do some work.

However, I ran the same code on Windows with MSVC (19.15) and no thread was starved.

It looks like you're using posix, so I'd guess your standard library uses pthreads on the back-end? (I have to link pthreads even with C++11.) Pthreads mutexes don't guarantee fairness. But that's only half the story. Your output seems to be related to the usleep call.

If I take out the usleep, I see fairness (Linux):

    // fair again
    while (true)
    {
        std::lock_guard <std::mutex> thread_lock (m);
        std::cerr << "#";
        std::cerr.flush ();
    }

My guess is that, due to sleeping so long while holding the mutex, it is virtually guaranteed that the main thread will be as blocked as blocked can be. Imagine that at first the main thread might try to spin in hope that the mutex will become available soon. After a while, it might get put on the waiting list.

In the auxiliary thread, the lock_guard object is destroyed at the end of the loop, thus the mutex is released. It will wake the main thread, but it immediately constructs a new lock_guard which locks the mutex again. It's unlikely that the main thread will grab the mutex because it was just scheduled. So unless a context switch occurs in this small window, the auxiliary thread will probably get the mutex again.

In the code with the scope block, the mutex in the auxiliary thread is released before the IO call. Printing to the screen takes a long time, so there is plenty of time for the main thread to get a chance to grab the mutex.

As @Ted Lyngmo said in his answer, if you add a sleep before the lock_guard is created, it makes starvation much less likely.

    while (true)
    {
        usleep (1);
        std::lock_guard <std::mutex> thread_lock (m);
        usleep (10*1000);
        std::cerr << "#";
        std::cerr.flush ();
    }

I also tried this with yield, but I needed like 5+ to make it more fair, which leads me to believe that there are other nuances in the actual library implementation details, OS scheduler, and caching and memory subsystem effects.

By the way, thanks for a great question. It was really easy to test and play around with.

What is the prefered way of using lock guard and mutex

You are unlocking the mutex to lock it immediately afterwards. What happens depends on how mutex is implementated, but typical non-fair implementation will awake one waiting thread, but will grab the mutex before that thread is able to run, wasting execution time.

If your mutex implementation is fair (think of ticket lock), then your thread won't be able to lock the mutex after unlocking it and will have to wait until another thread leaves the critical section. That means under contention your thread will have to do a context switch at every iteration, wasting execution time.

So the second case (mutex outside the loop) should be more efficient with both fair and non-fair mutex implementation and this is what you should do.

Now you might consider locking the mutex at every iteration if you care about latency, as this allows other threads to run, but this only makes sense with fair mutex implementation.

C++ does not say anything about whether std::mutex is fair or not and most implementations are not fair. So don't expect much from it.

So the only sane and portable way is to put the lock outside the loop. Because even if you care about latency, std::mutex is not something that can help you.

What makes the boost::shared_mutex so slow

I don't know the particulars of how the standard library/boost/etc. implementations differ, although it seems like the standard library version is faster (congrats, whoever wrote it).

So instead I'll try to explain the speed differences between various mutex types on a theoretical level, which will explain why a shared mutex (should) be slower.

Atomic Spin Lock

More-so as an academic exercise, consider the simplest thread-safety "mutex-like" implementation: a simple atomic spin lock.

In essence, this is nothing more than an std::atomic<bool> or an std::atomic_flag. It is initialized to false. To "lock" the mutex, you simply do an atomic compare-and-exchange operation in a loop until you get a false value (i.e. the previous value was false prior to atomically setting it to true).

std::atomic_flag flag = ATOMIC_FLAG_INIT;

// lock it by looping until we observe a false value
while (flag.test_and_set()) ;

// do stuff under "mutex" lock

// unlock by setting it back to false state
flag.clear();

However, due to the nature of this construct, it's what we call an "unfair" mutex because the order of threads that acquire the lock is not necessarily the order in which they began their attempts to lock it. That is, under high contention, it's possible a thread tries to lock and simply never succeed because other threads are luckier. It's very timing-sensitive. Imagine musical chairs.

Because of this, although it functions like a mutex, it's not what we consider a "mutex".

Mutex

A mutex can be thought of as building on top of an atomic spin lock (although it's not typically implemented as such, since they typically are implemented with support of the operating system and/or hardware).

In essence, a mutex is a step above atomic spin locks because it has a queue of waiting threads. This allows it to be "fair" because the order of lock acquisition is (more or less) the same as the order of locking attempts.

If you've noticed, if you run sizeof(std::mutex) it might be a bit larger than you might expect. On my platform it's 40 bytes. That extra space is used to hold state information, notably including some way of accessing a lock queue for each individual mutex.

When you try to lock a mutex, it performs some low-level thread-safety operation to have thread-safe access to the mutex's status information (e.g. atomic spin lock), checks the state of the mutex, adds your thread to the lock queue, and (typically) puts your thread to sleep while you wait so you don't burn precious CPU time. The low-level thread-safety operation (e.g. the atomic spin lock) is atomically released at the same time the thread goes to sleep (this is typically where OS or hardware support is necessary to be efficient).

Unlocking is performed by performing a low-level thread-safe operation (e.g. atomic spin lock), popping the next waiting thread from the queue, and waking it up. The thread that has been awoken now "owns" the lock. Rinse wash and repeat.

Shared Mutex

A shared mutex takes this concept a step further. It can be owned by a single thread for read/write permissions (like a normal mutex), or by multiple threads for read-only permissions (semantically, anyway - it's up to the programmer to ensure it's safe).

Thus, in addition to the unique ownership queue (like a normal mutex) it also has a shared ownership state. The shared ownership state could be simply a count of the number of threads that currently have shared ownership. If you inspect sizeof(std::shared_mutex) you'll find it's typically even larger than std::mutex. On my system, for instance, it's 56 bytes.

So when you go to lock a shared mutex, it has to do everything a normal mutex does, but additionally has to verify some other stuff. For instance, if you're trying to lock uniquely it must verify that there are no shared owners. And when you're trying to lock sharingly it must verify that there are no unique owners.

Because we typically want mutexes to be "fair", once a unique locker is in the queue, future sharing lock attempts must queue instead of acquiring the lock, even though it might currently be under sharing (i.e. non-unique) lock by several threads. This is to ensure shared owners don't "bully" a thread that wants unique ownership.

But this also goes the other way: the queuing logic must ensure a shared locker is never placed into an empty queue during shared ownership (because it should immediately succeed and be another shared owner).

Additionally, if there's a unique locker, followed by a shared locker, followed by a unique locker, it must (roughly) guarantee that acquisition order. So each entry in the lock queue also needs a flag denoting its purpose (i.e. shared vs. unique).

And then we think of the wake-up logic. When you unlock a shared mutex, the logic differs depending on the current ownership type of the mutex. If the unlocking thread has unique ownership or is the last shared owner it might have to wake up some thread(s) from the queue. It will either wake up all threads at the front of the queue who are requesting shared ownership, or a single thread at the front of the queue requesting unique ownership.

As you can imagine, all of this extra logic for who's locking for what reasons and how it changes depending not only on the current owners but also on the contents of the queue makes this potentially quite a bit slower. The hope is that you read significantly more frequent than you write, and thus you can have many sharing owners running concurrently, which mitigates the performance hit of coordinating all of that.