Are Mutex Lock Functions Sufficient Without Volatile

Are mutex lock functions sufficient without volatile?

If the above code is correct, how is it invulnerable to caching
issues?

Until C++0x, it is not. And it is not specified in C. So, it really depends on the compiler. In general, if the compiler does not guarantee that it will respect ordering constraints on memory accesses for functions or operations that involve multiple threads, you will not be able to write multithreaded safe code with that compiler. See Hans J Boehm's Threads Cannot be Implemented as a Library.

As for what abstractions your compiler should support for thread safe code, the wikipedia entry on Memory Barriers is a pretty good starting point.

(As for why people suggested volatile, some compilers treat volatile as a memory barrier for the compiler. It's definitely not standard.)

Does a mutex lock apply to functions called after locking?

As an example, if 2 threads called DXRenderer::draw(), would both threads call drawFRect

No, because only 1 thread at a time can hold the lock on the mutex.

would it wait until the mutex has been unlocked?

Yes. That is the whole point of a mutex. Once a thread owns the lock on a mutex, any other thread that tries to obtain the lock to the same mutex will wait until the owning thread unlocks it.

And FYI, you should not be calling lock() and unlock() manually, use std::lock_guard (or similar RAII helper) instead, eg:

void DXRenderer::draw(IDirect3DDevice9* _device) {
    std::lock_guard<std::mutex> lock(mutex);
    // do work as needed...
} // <-- automatically unlocks here

Is volatile necessary for the resource used in a critical section?

I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource.

Making sure that only one thread accesses a shared resource at a time is only part of what a locking mechanism adequate for the purpose will do. Among other things, such a mechanism will also ensure that all writes to shared objects performed by thread T_i before it releases lock L are visible to all other threads T_j after they subsequently acquire lock L. And that in terms of the C semantics of the program, notwithstanding any questions of compiler optimization, register usage, CPU instruction reordering, or similar.

When such a locking mechanism is used, volatile does not provide any additional benefit for making threads' writes to shared objects be visible to each other. When such a locking mechanism is not used, volatile does not provide a complete substitute.

C's built-in (since C11) mutexes provide a suitable locking mechanism, at least when using C's built-in threads. So do pthreads mutexes, Sys V and POSIX semaphores, and various other, similar synchronization objects available in various environments, each with respect to corresponding multithreading systems. These semantics are pretty consistent across C-like multithreading implementations, extending at least as far as Java. The semantic requirements for C's built-in multithreading are described in section 5.1.2.4 of the current (C17) language spec.

volatile is for indicating that an object might be accessed outside the scope of the C semantics of the program. That may happen to produce properties that interact with multithreaded execution in a way that is taken to be desirable, but that is not the purpose or intended use of volatile. If it were, or if volatile were sufficient for such purposes, then we would not also need _Atomic objects and operations.

The previous remarks focus on language-level semantics, and that is sufficient to answer the question. However, inasmuch as the question asks specifically about accessing variables' values from registers, I observe that compilers don't actually have to do anything much multithreading-specific in that area as long as acquiring and releasing locks requires calling functions.

In particular, if an execution E of function f writes to an object o that is visible to other functions or other executions of f, then the C implementation must ensure that that write is actually performed on memory before E evaluates any subsequent function call (such as is needed to release a lock). This is necessary because because the value written must be visible to the execution of the called function, regardless of any other threads.

Similarly, if E uses the value of o after return from a function call (such as is needed to acquire a lock) then it must load that value from memory to ensure that it sees the effect of any write that the function may have performed.

The only thing special to multithreading in this regard is that the implementation must ensure that interprocedural analysis optimizations or similar do not subvert the needed memory reads and writes around the lock and unlock functions. In practice, this rarely requires special attention.

How does a mutex lock and unlock functions prevents CPU reordering?

The short answer is that the body of the pthread_mutex_lock and pthread_mutex_unlock calls will include the necessary platform-specific memory barriers which will prevent the CPU from moving memory accesses within the critical section outside of it. The instruction flow will move from the calling code into the lock and unlock functions via a call instruction, and it is this dynamic instruction trace you have to consider for the purposes of reordering - not the static sequence you see in an assembly listing.

On x86 specifically, you probably won't find explicit, standalone memory barriers inside those methods, since you'll already have lock-prefixed instructions in order to perform the actual locking and unlocking atomically, and these instructions imply a full memory barrier, which prevents the CPU reordering you are concerned about.

For example, on my Ubuntu 16.04 system with glibc 2.23, pthread_mutex_lock is implemented using a lock cmpxchg (compare-and-exchange) and pthread_mutex_unlock is implemented using lock dec (decrement), both of which have full barrier semantics.

Should the mutex manipulation functions parameter be `volatile`?

The answer depends on whether the implementation of these functions is visible to the call site with the while loop. If it isn't (the call site only sees the declarations, definitions are in separate source files), then the volatile keyword will change nothing. Optimizer has absolutely no idea what this function does with the argument, whether it has side effects and so on, so each function call will be made.

On the other hand, if the functions to release and acquire the mutex are inline - so the complete implementation is visible on the call site - then indeed optimizer may "tweak" things up a little. The problem is in the "complete" word - even if the code you posted would be inline, the code which starts and ends critical section is probably not. And even if it is, it may use assembly statements which optimizer wouldn't understand. Even if it's pure C, then it probably accesses some volatile memory-mapped registers. Any such piece of code (calling external function, assembly statements, accessing volatile memory) effectively prohibits optimizer from eliminating all the calls, as in that case it must assume that each call has side effects.

Are Mutex Lock Functions Sufficient Without Volatile