Memory Barrier Generators

Memory barrier generators

Here is my take on the subject and to attempt to provide a quasi-complete list in one answer. If I run across any others I will edit my answer from time to time.

Mechanisms that are generally agreed upon to cause implicit barriers:

All Monitor class methods including the C# keyword lock
All Interlocked class methods.
All Volatile class methods (.NET 4.5+).
Most SpinLock methods including Enter and Exit.
Thread.Join
Thread.VolatileRead and Thread.VolatileWrite
Thread.MemoryBarrier
The volatile keyword.
Anything that starts a thread or causes a delegate to execute on another thread including QueueUserWorkItem, Task.Factory.StartNew, Thread.Start, compiler supplied BeginInvoke methods, etc.
Using a signaling mechanism such as ManualResetEvent, AutoResetEvent, CountdownEvent, Semaphore, Barrier, etc.
Using marshaling operations such as Control.Invoke, Dispatcher.Invoke, SynchronizationContext.Post, etc.

Mechanisms that are speculated (but not known for certain) to cause implicit barriers:

Thread.Sleep (proposed by myself and possibly others due to the fact that code which exhibits a memory barrier problem can be fixed with this method)
Thread.Yield
Thread.SpinWait
Lazy<T> depending on which LazyThreadSafetyMode is specified

Other notable mentions:

Default add and remove handlers for events in C# since they use lock or Interlocked.CompareExchange.
x86 stores have release fence semantics
Microsoft's implemenation of the CLI has release fence semantics on writes despite the fact that the ECMA specification does not mandate it.
MarshalByRefObject seems to suppress certain optimizations in subclasses which may make it appear as if an implicit memory barrier were present. Thanks to Hans Passant for discovering this and bringing it to my attention.¹

¹This explains why BackgroundWorker works correctly without having volatile on the underlying field for the CancellationPending property.

Memory barrier on single core ARM

Why would a context switch on a single core behave differently compared to 2 threads on different cores ? (except any cache coherency issues)

The threads on separate cores may act at exactly the same time. You still have issues on a single core.

Somewhere here on Stackoverflow is also stated that memory barriers are not required on single core processors.

This information maybe taken out of context (or not provide enough context).

Wikipedia's Memory barrier and Memory ordering pages have sections Out-of-order execution versus compiler reordering optimizations and Compile time/Run time ordering. There are many places in a pipeline where the ordering of memory may matter. In some cases, this may be taken care of by the compiler, by the OS, or by our own code.

Compiler memory barriers apply to a single CPU. They are especially useful with hardware where the ordering and timing of writes and reads matter.

Linux defines some more types of memory barriers,

Write/Store.
Data dependency.
Read/Load.
General memory barriers.

Mainly these map fairly well to DMB (DSB and IMB are more for code modification).

The more advances ARM CPUs have multiple load/store units. In theory some non-preemptive threading switch ^Note1 (especially with aliased memory) could cause some issue with a multi-threaded single CPU application. However, it would be fairly hard to construct this case.

For the most part, good memory ordering is handled by the CPU by scheduling instructions. A common case where it does matter with a single CPU is for system level programmers altering CP15 registers. For instance, an ISB should be issued when turning on the MMU. The same may be true for certain hardware/device registers. Finally, a program loader will need barriers as well as cache operations, even on single CPU systems.

UnixSmurf wrote these blogs on memory access ordering,

Intro
Barriers and the Linux kernel
Memory access and the ARM architecture

The topic is complex and you have to be specific about the types of barriers you are discussing.

Note1: I say non preemptive as if an interrupt occurs, the single CPU will probably ensure that all outstanding memory requests are complete. With a non preemptive switch, you do something like longjmp to change threads. In theory, you could change contexts before all writes had completed. The system would only need a DMB in the yield() to avoid it.

Is function call an effective memory barrier for modern platforms?

Memory barriers aren't just to prevent instruction reordering. Even if instructions aren't reordered it can still cause problems with cache coherence. As for the reordering - it depends on your compiler and settings. ICC is particularly agressive with reordering. MSVC w/ whole program optimization can be, too.

If your shared data variable is declared as volatile, even though it's not in the spec most compilers will generate a memory variable around reads and writes from the variable and prevent reordering. This is not the correct way of using volatile, nor what it was meant for.

(If I had any votes left, I'd +1 your question for the narration.)

Thread safe usage of lock helpers (concerning memory barriers)

No, you do not need to do anything special to guarentee that memory barriers are created. This is because almost any mechanism used to get a method executing on another thread produces a release-fence barrier on the calling thread and an aquire-fence barrier on the worker thread (actually they may be full fence barriers). So either QueueUserWorkItem or Thread.Start will automatically insert the necessary barriers. Your code is safe.

Also, as a matter of tangential interest Thread.Sleep also generates a memory barrier. This is interesting because some people naively use Thread.Sleep to simulate thread interleaving. If this strategy were used to troubleshoot low-lock code then it could very well mask the problem you were trying to find.

Explanation of Thread.MemoryBarrier() Bug with OoOP

It doesn't fix any issues. It's a fake fix, rather dangerous in production code, as it may work, or it may not work.

The core problem is in this line

static bool stop = false;

The variable that stops a while loop is not volatile. Which means it may or may not be read from memory all the time. It can be cached, so that only the last read value is presented to a system (which may not be the actual current value).

This code

// Thread.MemoryBarrier() or Console.WriteLine() fixes issue

May or may not fix an issue on different platforms. Memory barrier or console write just happen to force application to read fresh values on a particular system. It may not be the same elsewhere.

Additionally, volatile and Thread.MemoryBarrier() only provide weak guarantees, which means they don't provide 100% assurance that a read value will always be the latest on all systems and CPUs.

Eric Lippert says

The true semantics of volatile reads
and writes are considerably more complex than I've outlined here; in
fact they do not actually guarantee that every processor stops what it
is doing and updates caches to/from main memory. Rather, they provide
weaker guarantees about how memory accesses before and after reads and
writes may be observed to be ordered with respect to each other.
Certain operations such as creating a new thread, entering a lock, or
using one of the Interlocked family of methods introduce stronger
guarantees about observation of ordering. If you want more details,
read sections 3.10 and 10.5.3 of the C# 4.0 specification.

Why does multithreaded code using CancellationTokenSource.Cancel require less anti-reordering measures

As it was noted by the author of The Old New Thing in his comment, source.Cancel(); instruction placed in multithreaded code is protected from reordering by means of its internal implementation.

https://referencesource.microsoft.com/#mscorlib/system/threading/CancellationTokenSource.cs,723 states that CancellationTokenSource relies upon Interlocked class methods.

According to Joe Albahari, all methods on the Interlocked class in C# implicitly generate full fences: http://www.albahari.com/threading/part4.aspx#_Memory_Barriers_and_Volatility

So one can freely place a call to CancellationTokenSource.Cancel method inside a delegate body without an additional lock or memory barrier if they need to protect it while accessed by multiple tasks.

Memory Barrier Generators