Std::Mutex Performance Compared to Win32 Critical_Section

Win32 Critical Section vs Mutex Performance

I think there are two factors:

Mainly - Your program is dominated by thread creation overhead. You are creating and destroying 2000 threads, and only accessing the mutex/CS once per thread. The time spent creating threads swamps the difference in lock/unlock times.

Also - You may not be testing the use case that these locks were optimized for. Try spawning two threads that each try to access the mutex/CS thousands of times.

Why is std::mutex twice as slow as CRITICAL_SECTION

A std::mutex provides non-recursive ownership semantics. A CRITICAL_SECTION provides recursive semantics. So I assume the extra layer in the std::mutex implementation is (at least in part) to resolve this difference.

Update: Stepping through the code, it looks like std::mutex is implemented in terms of a queue and InterlockedX instructions rather than a classical Win32 CRITICAL_SECTION. Even though std::mutex is non-recursive, the underlying code in the RTL can optionally handle recursive and even timed locks.

Cost of mutex,critical section etc on Windows

Considering the specific purpose of Critical Sections and Mutexes I don't think you can ask a question regarding the cost as you don't have much alternative when you need multiple threads touching the same data. Obviously, if you just need to increment/decrement a number, you can use the Interlocked*() functions on a volatile number and you're good to go. But for anything more complex, you need to use a synchronization object.

Start your reading here on the Synchronization Objects available on Windows^. All functions are listed there, nicely grouped and properly explained. Some are Windows 8 only.

As regarding your question, Critical Sections are less expensive than Mutexes as they are designed to operate in the same process. Read this^ and this^ or just the following quote.

A critical section object provides synchronization similar to that provided by a mutex object, except that a critical section can be used only by the threads of a single process. Event, mutex, and semaphore objects can also be used in a single-process application, but critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization (a processor-specific test and set instruction). Like a mutex object, a critical section object can be owned by only one thread at a time, which makes it useful for protecting a shared resource from simultaneous access. Unlike a mutex object, there is no way to tell whether a critical section has been abandoned.

I use Critical Sections for same process synchronization and Mutexes for cross-process synchronization. Only when I REALLY need to know if a synchronization object was abandoned, I use Mutexes in the same process.

So, if you need a sync object, the question is not what are the costs but which is cheaper :) There's really no alternative but memory corruption.

PS: There might be alternatives like the one mentioned in the selected answer here^ but I always go for core platform-specific functionality vs. cross-platformness. It's always faster! So if you use Windows, use the tools of Windows :)

UPDATE

Based on your needs, you might be able to reduce the need of sync objects by trying to do as much self-contained work in a thread as possible and only combine the data at the end or every now and then.

Stupid Example: Take a list of URLs. You need to scrape them and analyze them.

Throw in a bunch of threads and start picking URLs, one by one, from the input list. For each one your process you centralize the results as you do it. It's real time and cool
Or you can throw in the threads each of them having a slice of the input URLs. This removes the need to sync the selection process. You store the analysis result in the thread and at the end, you combine the result just once. Or just once every 10 URLs let's say. Not for each of them. This will reduce the sync operations dramatically.

So costs can be lowered by choosing the right tool and thinking how to lower the lock and unlocks. But costs cannot be removed :)

PS: I only think in URLs :)

UPDATE 2:

Had the need in a project to do some measuring. And the results were quite surprising:

A std::mutex is most expensive. (price of cross-platformness)
A Windows native Mutex is 2x faster than std.
A Critical Section is 2x faster than the native Mutex.
A SlimReadWriteLock is +-10% of the Critical Section.
My homemade InterlockedMutex (spinlock) is 1.25x - 1.75x faster than the Critical Section.

What is the difference between mutex and critical section?

For Windows, critical sections are lighter-weight than mutexes.

Mutexes can be shared between processes, but always result in a system call to the kernel which has some overhead.

Critical sections can only be used within one process, but have the advantage that they only switch to kernel mode in the case of contention - Uncontended acquires, which should be the common case, are incredibly fast. In the case of contention, they enter the kernel to wait on some synchronization primitive (like an event or semaphore).

I wrote a quick sample app that compares the time between the two of them. On my system for 1,000,000 uncontended acquires and releases, a mutex takes over one second. A critical section takes ~50 ms for 1,000,000 acquires.

Here's the test code, I ran this and got similar results if mutex is first or second, so we aren't seeing any other effects.

HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
CRITICAL_SECTION critSec;
InitializeCriticalSection(&critSec);

LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start, end;

// Force code into memory, so we don't see any effects of paging.
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
    EnterCriticalSection(&critSec);
    LeaveCriticalSection(&critSec);
}

QueryPerformanceCounter(&end);

int totalTimeCS = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);

// Force code into memory, so we don't see any effects of paging.
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);

QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
    WaitForSingleObject(mutex, INFINITE);
    ReleaseMutex(mutex);
}

QueryPerformanceCounter(&end);

int totalTime = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);

printf("Mutex: %d CritSec: %d\n", totalTime, totalTimeCS);

Why is std::mutex so much worse than std::shared_mutex in Visual C++?

TL;DR: unfortunate combination of backward compatibility and ABI compatibility issues makes std::mutex bad until the next ABI break. OTOH, std::shared_mutex is good.

A decent implementation of std::mutex would try to use an atomic operation to acquire the lock, if busy, possibly would try spinning in a read loop (with some pause on x86), and ultimately will resort to OS wait.

There are a couple of ways to implement such std::mutex:

Directly delegate to corresponding OS APIs that do all of above.
Do spinning and atomic thing on its own, call OS APIs only for OS wait.

Sure, the first way is easier to implement, more friendly to debug, more robust. So it appears to be the way to go. The candidate APIs are:

CRITICAL_SECTION APIs. A recursive mutex, that is lacking static initializer and needs explicit destruction
SRWLOCK. A non-recursive shared mutex that has static initializer and doesn't need explicit destruction
WaitOnAddress. An API to wait on particular variable to be changed, similar to Linux futex.

These primitives have OS version requirements:

CRITICAL_SECTION existed since I think Windows 95, though TryEnterCriticalSection was not present in Windows 9x, but the ability to use CRITICAL_SECTION with CONDITION_VARIABLE was added since Windows Vista, with CONDITION_VARIABLE itself.
SRWLOCK exists since Windows Vista, but TryAcquireSRWLockExclusive exists since Windows 7, so it can only directly implement std::mutex starting in Windows 7.
WaitOnAddress was added since Windows 8.

By the time when std::mutex was added, Windows XP support by Visual Studio C++ library was needed, so it was implemented using doing things on its own. In fact, std::mutex and other sync stuff was delegated to ConCRT (Concurrency Runtime)

For Visual Studio 2015, the implementation was switched to use the best available mechanism, that is SRWLOCK starting in Windows 7, and CRITICAL_SECTION stating in Windows Vista. ConCRT turned out to be not the best mechanism, but it still was used for Windows XP and 2003. The polymorphism was implemented by making placement new of classes with virtual functions into a buffer provided by std::mutex and other primitives.

Note that this implementation breaks the requirement for std::mutex to be constexpr, because of runtime detection, placement new, and inability of pre-Window 7 implementation to have only static initializer.

As time passed support of Windows XP was finally dropped in VS 2019, and support of Windows Vista was dropped in VS 2022, the change is made to avoid ConCRT usage, the change is planned to avoid even runtime detection of SRWLOCK (disclosure: I've contributed these PRs). Still due to ABI compatibility for VS 2015 though VS 2022 it is not possible to simplify std::mutex implementation to avoid all this putting classes with virtual functions.

What is more sad, though SRWLOCK has static initializer, the said compatibility prevents from having constexpr mutex: we have to placement new the implementation there. It is not possible to avoid placement new, and make an implementation to construct right inside std::mutex, because std::mutex has to be standard layout class (see Why is std::mutex a standard-layout class?).

So the size overhead comes from the size of ConCRT mutex.

And the runtime overhead comes from the chain of call:

library function call to get to the standard library implementation
virtual function call to get to SRWLOCK-based implementation
finally Windows API call.

Virtual function call is more expensive than usually due to standard library DLLs being built with /guard:cf.

Some part of the runtime overhead is due to std::mutex fills in ownership count and locked thread. Even though this information is not required for SRWLOCK. It is due to shared internal structure with recursive_mutex. The extra information may be helpful for debugging, but it does take time to fill it in.

std::shared_mutex was designed to support only systems starting Windows 7. So it uses SRWLOCK directly.

The size of std::shared_mutex is the size of SRWLOCK. SRWLOCK has the same size as a pointer (though internally it is not a pointer).

It still involves some avoidable overhead: it calls C++ runtime library, just to call Windows API, instead of calling Windows API directly. This looks fixable with the next ABI, though.

std::shared_mutex constructor could be constexpr, as SRWLOCK does not need dynamic initializer, but the standard prohibits voluntary adding constexpr to the standard classes.

Comparison of Win32 CMutex and Standard Library std::mutex

Functionality deference is sure -- CMutex maps directly to Win32 mutex type while std::mutex is way more basic, likely implemented using win32 CRITICAL_SECTION removing the recursive nature and std::recursive_mutex wrapping CRITICAL_SECTION. Those would work similar to CCriticalSection.

CMutex is a heavyweight that in practice is used to create named mutexes for interprocess communication. You should not use it intraprocess.

If your question stand comparing recursive_mutex vs CCriticalSection, I'd bet on practically same performance. Interface-wise CSingleLock has completely braindead interface (it takes a second argument that defaults to FALSE instead of TRUE), so in practice I never used it directly only through macro to avoid mistake.

In new code I'd first try to solve things using std::future, and fiddle with locks only as last resort. The C++11 threading makes perfect sense to use, so until you need CMultiLock functionality it is better. I did not yet explore how to cover the latter case, but would be surprised if it can't be done easily.

Does Boost have support for Windows EnterCriticalSection API?

So, it looks like the default Boost mutex doesn't support it, but asio::detail::mutex does.

So I ended up using that:

#include <boost/asio/detail/mutex.hpp>
#include <boost/thread.hpp>

using boost::asio::detail::mutex;
using boost::lock_guard;

int myFunc()
{
  static mutex mtx;
  lock_guard<mutex> lock(mtx);
  . . .
}