Does the C++ Volatile Keyword Introduce a Memory Fence

Does the C++ volatile keyword introduce a memory fence?

Rather than explaining what volatile does, allow me to explain when you should use volatile.

When inside a signal handler. Because writing to a volatile variable is pretty much the only thing the standard allows you to do from within a signal handler. Since C++11 you can use std::atomic for that purpose, but only if the atomic is lock-free.
When dealing with setjmp according to Intel.
When dealing directly with hardware and you want to ensure that the compiler does not optimize your reads or writes away.

For example:

volatile int *foo = some_memory_mapped_device;
while (*foo)
    ; // wait until *foo turns false

Without the volatile specifier, the compiler is allowed to completely optimize the loop away. The volatile specifier tells the compiler that it may not assume that 2 subsequent reads return the same value.

Note that volatile has nothing to do with threads. The above example does not work if there was a different thread writing to *foo because there is no acquire operation involved.

In all other cases, usage of volatile should be considered non-portable and not pass code review anymore except when dealing with pre-C++11 compilers and compiler extensions (such as msvc's /volatile:ms switch, which is enabled by default under X86/I64).

How do I Understand Read Memory Barriers and Volatile

There are read barriers and write barriers; acquire barriers and release barriers. And more (io vs memory, etc).

The barriers are not there to control "latest" value or "freshness" of the values. They are there to control the relative ordering of memory accesses.

Write barriers control the order of writes. Because writes to memory are slow (compared to the speed of the CPU), there is usually a write-request queue where writes are posted before they 'really happen'. Although they are queued in order, while inside the queue the writes may be reordered. (So maybe 'queue' isn't the best name...) Unless you use write barriers to prevent the reordering.

Read barriers control the order of reads. Because of speculative execution (CPU looks ahead and loads from memory early) and because of the existence of the write buffer (the CPU will read a value from the write buffer instead of memory if it is there - ie the CPU thinks it just wrote X = 5, then why read it back, just see that it is still waiting to become 5 in the write buffer) reads may happen out of order.

This is true regardless of what the compiler tries to do with respect to the order of the generated code. ie 'volatile' in C++ won't help here, because it only tells the compiler to output code to re-read the value from "memory", it does NOT tell the CPU how/where to read it from (ie "memory" is many things at the CPU level).

So read/write barriers put up blocks to prevent reordering in the read/write queues (the read isn't usually so much of a queue, but the reordering effects are the same).

What kinds of blocks? - acquire and/or release blocks.

Acquire - eg read-acquire(x) will add the read of x into the read-queue and flush the queue (not really flush the queue, but add a marker saying don't reorder anything before this read, which is as if the queue was flushed). So later (in code order) reads can be reordered, but not before the read of x.

Release - eg write-release(x, 5) will flush (or marker) the queue first, then add the write-request to the write-queue. So earlier writes won't become reordered to happen after x = 5, but note that later writes can be reordered before x = 5.

Note that I paired the read with acquire and write with release because this is typical, but different combinations are possible.

Acquire and Release are considered 'half-barriers' or 'half-fences' because they only stop the reordering from going one way.

A full barrier (or full fence) applies both an acquire and a release - ie no reordering.

Typically for lockfree programming, or C# or java 'volatile', what you want/need is
read-acquire and write-release.

void threadA()
{
   foo->x = 10;
   foo->y = 11;
   foo->z = 12;
   write_release(foo->ready, true);
   bar = 13;
}
void threadB()
{
   w = some_global;
   ready = read_acquire(foo->ready);
   if (ready)
   {
      q = w * foo->x * foo->y * foo->z;
   }
   else
       calculate_pi();
}

So, first of all, this is a bad way to program threads. Locks would be safer. But just to illustrate barriers...

After threadA() is done writing foo, it needs to write foo->ready LAST, really last, else other threads might see foo->ready early and get the wrong values of x/y/z. So we use a write_release on foo->ready, which, as mentioned above, effectively 'flushes' the write queue (ensuring x,y,z are committed) then adds the ready=true request to the queue. And then adds the bar=13 request. Note that since we just used a release barrier (not a full) bar=13 may get written before ready. But we don't care! ie we are assuming bar is not changing shared data.

Now threadB() needs to know that when we say 'ready' we really mean ready. So we do a read_acquire(foo->ready). This read is added to the read queue, THEN the queue is flushed. Note that w = some_global may also still be in the queue. So foo->ready may be read before some_global. But again, we don't care, as it is not part of the important data that we are being so careful about.
What we do care about is foo->x/y/z. So they are added to the read queue after the acquire flush/marker, guaranteeing that they are read only after reading foo->ready.

Note also, that this is typically the exact same barriers used for locking and unlocking a mutex/CriticalSection/etc. (ie acquire on lock(), release on unlock() ).

So,

I'm pretty sure this (ie acquire/release) is exactly what MS docs say happens for read/writes of 'volatile' variables in C# (and optionally for MS C++, but this is non-standard). See http://msdn.microsoft.com/en-us/library/aa645755(VS.71).aspx including "A volatile read has "acquire semantics"; that is, it is guaranteed to occur prior to any references to memory that occur after it..."
I think java is the same, although I'm not as familiar. I suspect it is exactly the same, because you just don't typically need more guarantees than read-acquire/write-release.
In your question you were on the right track when thinking that it is really all about relative order - you just had the orderings backwards (ie "the values that are read are at least as up-to-date as the reads before the barrier? " - no, reads before the barrier are unimportant, its reads AFTER the barrier that are guaranteed to be AFTER, vice versa for writes).
And please note, as mentioned, reordering happens on both reads and writes, so only using a barrier on one thread and not the other WILL NOT WORK. ie a write-release isn't enough without the read-acquire. ie even if you write it in the right order, it could be read in the wrong order if you didn't use the read barriers to go with the write barriers.
And lastly, note that lock-free programming and CPU memory architectures can be actually much more complicated than that, but sticking with acquire/release will get you pretty far.

C - volatile and memory barriers in lockless shared memory access?

is my understanding correct and complete ?

Yeah, it looks that way, except for not mentioning that C11 <stdatomic.h> made all this obsolete for almost all purposes.

There are more bad/weird things that can happen without volatile (or better, _Atomic) that you didn't list: the LWN article Who's afraid of a big bad optimizing compiler? goes into detail about things like inventing extra loads (and expecting them both to read the same value). It's aimed at Linux kernel code, where C11 _Atomic isn't how they do things.

Other than the Linux kernel, new code should pretty much always use <stdatomic.h> instead of rolling your own atomics with volatile and inline asm for RMWs and barriers. But that does continue to work because all real-world CPUs that we run threads across have coherent shared memory, so making a memory access happen in the asm is enough for inter-thread visibility, like memory_order_relaxed. See When to use volatile with multi threading? (basically never, except in the Linux kernel or maybe a handful of other codebases that already have good implementations of hand-rolled stuff).

In ISO C11, it's data-race undefined behaviour for two threads to do unsynchronized read+write on the same object, but mainstream compilers do define the behaviour, just compiling the way you'd expect so hardware guarantees or lack thereof come into play.

Other than that, yeah, looks accurate except for your final question 2: there are use-cases for memory_order_relaxed atomics, which is like volatile with no barriers, e.g. an exit_now flag.

or are there are cases where only using barriers suffice ?

No, unless you get lucky and the compiler happens to generate correct asm anyway.

Or unless other synchronization means this code only runs while no other threads are reading/writing the object. (C++20 has std::atomic_ref<T> to handle the case where some parts of the code need to do atomic accesses to data, but other parts of your program don't, and you want to let them auto-vectorize or whatever. C doesn't have any such thing yet, other than using plain variables with/without GNU C __atomic_load_n() and other builtins, which is how C++ headers implement std::atomic<T>, and which is the same underlying support that C11 _Atomic compiles to. Probably also the C11 functions like atomic_load_explicit defined in stdatomic.h, but unlike C++, _Atomic is a true keyword not defined in any header.)

Does volatile guarantee anything at all in portable C code for multi-core systems?

To summarize the problem, it appears (after reading a lot) that
"volatile" guarantees something like: The value will be read/written
not just from/to a register, but at least to the core's L1 cache, in
the same order that the reads/writes appear in the code.

No, it absolutely does not. And that makes volatile almost useless for the purpose of MT safe code.

If it did, then volatile would be quite good for variables shared by multiple thread as ordering the events in the L1 cache is all you need to do in typical CPU (that is either multi-core or multi-CPU on motherboard) capable of cooperating in a way that makes a normal implementation of either C/C++ or Java multithreading possible with typical expected costs (that is, not a huge cost on most atomic or non-contented mutex operations).

But volatile does not provide any guaranteed ordering (or "memory visibility") in the cache either in theory or in practice.

(Note: the following is based on sound interpretation of the standard documents, the standard's intent, historical practice, and a deep understand of the expectations of compiler writers. This approach based on history, actual practices, and expectations and understanding of real persons in the real world, which is much stronger and more reliable than parsing the words of a document that is not known to be stellar specification writing and which has been revised many times.)

In practice, volatile does guarantees ptrace-ability that is the ability to use debug information for the running program, at any level of optimization, and the fact the debug information makes sense for these volatile objects:

you may use ptrace (a ptrace-like mechanism) to set meaningful break points at the sequence points after operations involving volatile objects: you can really break at exactly these points (note that this works only if you are willing to set many break points as any C/C++ statement may be compiled to many different assembly start and end points, as in a massively unrolled loop);
while a thread of execution of stopped, you may read the value of all volatile objects, as they have their canonical representation (following the ABI for their respective type); a non volatile local variable could have an atypical representation, f.ex. a shifted representation: a variable used for indexing an array might be multiplied by the size of individual objects, for easier indexing; or it might be replaced by a pointer to an array element (as long as all uses of the variable as similarly converted) (think changing dx to du in an integral);
you can also modify those objects (as long as the memory mappings allow that, as volatile object with static lifetime that are const qualified might be in a memory range mapped read only).

Volatile guarantee in practice a little more than the strict ptrace interpretation: it also guarantees that volatile automatic variables have an address on the stack, as they aren't allocated to a register, a register allocation which would make ptrace manipulations more delicate (compiler can output debug information to explain how variables are allocated to registers, but reading and changing register state is slightly more involved than accessing memory addresses).

Note that full program debug-ability, that is considering all variables volatile at least at sequence points, is provided by the "zero optimization" mode of the compiler, a mode which still performs trivial optimizations like arithmetic simplifications (there is usually no guaranteed no optimization at all mode). But volatile is stronger than non optimization: x-x can be simplified for a non volatile integer x but not of a volatile object.

So volatile means guaranteed to be compiled as is, like the translation from source to binary/assembly by the compiler of a system call isn't a reinterpretation, changed, or optimized in any way by a compiler. Note that library calls may or may not be system calls. Many official system functions are actually library function that offer a thin layer of interposition and generally defer to the kernel at the end. (In particular getpid doesn't need to go to the kernel and could well read a memory location provided by the OS containing the information.)

Volatile interactions are interactions with the outside world of the real machine, which must follow the "abstract machine". They aren't internal interactions of program parts with other program parts. The compiler can only reason about what it knows, that is the internal program parts.

The code generation for a volatile access should follow the most natural interaction with that memory location: it should be unsurprising. That means that some volatile accesses are expected to be atomic: if the natural way to read or write the representation of a long on the architecture is atomic, then it's expected that a read or write of a volatile long will be atomic, as the compiler should not generate silly inefficient code to access volatile objects byte by byte, for example.

You should be able to determine that by knowing the architecture. You don't have to know anything about the compiler, as volatile means that the compiler should be transparent.

But volatile does no more than force the emission of expected assembly for the least optimized for particular cases to do a memory operation: volatile semantics means general case semantic.

The general case is what the compiler does when it doesn't have any information about a construct: f.ex. calling a virtual function on an lvalue via dynamic dispatch is a general case, making a direct call to the overrider after determining at compile time the type of the object designated by the expression is a particular case. The compiler always have a general case handling of all constructs, and it follows the ABI.

Volatile does nothing special to synchronize threads or provide "memory visibility": volatile only provides guarantees at the abstract level seen from inside a thread executing or stopped, that is the inside of a CPU core:

volatile says nothing about which memory operations reach main RAM (you may set specific memory caching types with assembly instructions or system calls to obtain these guarantees);
volatile doesn't provide any guarantee about when memory operations will be committed to any level of cache (not even L1).

Only the second point means volatile is not useful in most inter threads communication problems; the first point is essentially irrelevant in any programming problem that doesn't involve communication with hardware components outside the CPU(s) but still on the memory bus.

The property of volatile providing guaranteed behavior from the point of the view of the core running the thread means that asynchronous signals delivered to that thread, which are run from the point of view of the execution ordering of that thread, see operations in source code order.

Unless you plan to send signals to your threads (an extremely useful approach to consolidation of information about currently running threads with no previously agreed point of stopping), volatile is not for you.

Is volatile necessary for the resource used in a critical section?

I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource.

Making sure that only one thread accesses a shared resource at a time is only part of what a locking mechanism adequate for the purpose will do. Among other things, such a mechanism will also ensure that all writes to shared objects performed by thread T_i before it releases lock L are visible to all other threads T_j after they subsequently acquire lock L. And that in terms of the C semantics of the program, notwithstanding any questions of compiler optimization, register usage, CPU instruction reordering, or similar.

When such a locking mechanism is used, volatile does not provide any additional benefit for making threads' writes to shared objects be visible to each other. When such a locking mechanism is not used, volatile does not provide a complete substitute.

C's built-in (since C11) mutexes provide a suitable locking mechanism, at least when using C's built-in threads. So do pthreads mutexes, Sys V and POSIX semaphores, and various other, similar synchronization objects available in various environments, each with respect to corresponding multithreading systems. These semantics are pretty consistent across C-like multithreading implementations, extending at least as far as Java. The semantic requirements for C's built-in multithreading are described in section 5.1.2.4 of the current (C17) language spec.

volatile is for indicating that an object might be accessed outside the scope of the C semantics of the program. That may happen to produce properties that interact with multithreaded execution in a way that is taken to be desirable, but that is not the purpose or intended use of volatile. If it were, or if volatile were sufficient for such purposes, then we would not also need _Atomic objects and operations.

The previous remarks focus on language-level semantics, and that is sufficient to answer the question. However, inasmuch as the question asks specifically about accessing variables' values from registers, I observe that compilers don't actually have to do anything much multithreading-specific in that area as long as acquiring and releasing locks requires calling functions.

In particular, if an execution E of function f writes to an object o that is visible to other functions or other executions of f, then the C implementation must ensure that that write is actually performed on memory before E evaluates any subsequent function call (such as is needed to release a lock). This is necessary because because the value written must be visible to the execution of the called function, regardless of any other threads.

Similarly, if E uses the value of o after return from a function call (such as is needed to acquire a lock) then it must load that value from memory to ensure that it sees the effect of any write that the function may have performed.

The only thing special to multithreading in this regard is that the implementation must ensure that interprocedural analysis optimizations or similar do not subvert the needed memory reads and writes around the lock and unlock functions. In practice, this rarely requires special attention.

How compiler enforces C++ volatile in ARM assembly

so, in the below example, second store can be executed before first store since both destinations are disjoint, and hence they can be freely reordered.

The volatile keyword limits the reordering (and elision) of instructions by the compiler, but its semantics don't say anything about visibility from other threads or processors.

When you see

        str     r1, [r3]
        str     r2, [r3, #4]

then volatile has done everything required. If the addresses of x and y are I/O mapped to a hardware device, it will have received the x store first. If an interrupt pauses operation of this thread between the two instructions, the interrupt handler will see the x store and not the y. That's all that is guaranteed.

The memory ordering model only describes the order in which effects are observable from other processors. It doesn't alter the sequence in which instructions are issued (which is the order they appear in the assembly code), but the order in which they are committed (ie, a store becomes externally visible).

It is certainly possible that a different processor could see the result of the y store before the x - but volatile is not and never has been relevant to that problem. The cross-platform solution to this is std::atomic.

There is unfortunately a load of obsolete C code available on the internet that does use volatile for synchronization - but this is always a platform-specific extension, and was never a great idea anyway. Even less fortunately the keyword was given exactly those semantics in Java (which isn't really used for writing interrupt handlers), increasing the confusion.

If you do see something using volatile like this, it's either obsolete or was incompetently translated from Java. Use std::atomic, and for anything more complex than simple atomic load/store, it's probably better (and is certainly easier) to use std::mutex.

What Rules does compiler have to follow when dealing with volatile memory locations?

A particular and very common optimization that is ruled out by volatile is to cache a value from memory into a register, and use the register for repeated access (because this is much faster than going back to memory every time).

Instead the compiler must fetch the value from memory every time (taking a hint from Zach, I should say that "every time" is bounded by sequence points).

Nor can a sequence of writes make use of a register and only write the final value back later on: every write must be pushed out to memory.

Why is this useful? On some architectures certain IO devices map their inputs or outputs to a memory location (i.e. a byte written to that location actually goes out on the serial line). If the compiler redirects some of those writes to a register that is only flushed occasionally then most of the bytes won't go onto the serial line. Not good. Using volatile prevents this situation.

Does the C++ Volatile Keyword Introduce a Memory Fence