Does Interlocked.Compareexchange Use a Memory Barrier

Does Interlocked.CompareExchange use a memory barrier?

Any x86 instruction that has lock prefix has full memory barrier. As shown Abel's answer, Interlocked* APIs and CompareExchanges use lock-prefixed instruction such as lock cmpxchg. So, it implies memory fence.

Yes, Interlocked.CompareExchange uses a memory barrier.

Why? Because x86 processors did so. From Intel's Volume 3A: System Programming Guide Part 1, Section 7.1.2.2:

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

volatile has nothing to do with this discussion. This is about atomic operations; to support atomic operations in CPU, x86 guarantees all previous loads and stores to be completed.

Interlocked and Memory Barriers

The usual pattern for memory barrier usage matches what you would put in the implementation of a critical section, but split into pairs for the producer and consumer. As an example your critical section implementation would typically be of the form:


while (!pShared->lock.testAndSet_Acquire()) ;
// (this loop should include all the normal critical section stuff like
// spin, waste, 
// pause() instructions, and last-resort-give-up-and-blocking on a resource 
// until the lock is made available.)

// Access to shared memory.

pShared->foo = 1 
v = pShared-> goo

pShared->lock.clear_Release()

Acquire memory barrier above makes sure that any loads (pShared->goo) that may have been started before the successful lock modification are tossed, to be restarted if neccessary.

The release memory barrier ensures that the load from goo into the (local say) variable v is complete before the lock word protecting the shared memory is cleared.

You have a similar pattern in the typical producer and consumer atomic flag scenerio (it is difficult to tell by your sample if that is what you are doing but should illustrate the idea).

Suppose your producer used an atomic variable to indicate that some other state is ready to use. You'll want something like this:


pShared->goo = 14

pShared->atomic.setBit_Release()

Without a "write" barrier here in the producer you have no guarantee that the hardware isn't going to get to the atomic store before the goo store has made it through the cpu store queues, and up through the memory hierarchy where it is visible (even if you have a mechanism that ensures the compiler orders things the way you want).

In the consumer


if ( pShared->atomic.compareAndSwap_Acquire(1,1) )
{
   v = pShared->goo 
}

Without a "read" barrier here you won't know that the hardware hasn't gone and fetched goo for you before the atomic access is complete. The atomic (ie: memory manipulated with the Interlocked functions doing stuff like lock cmpxchg), is only "atomic" with respect to itself, not other memory.

Now, the remaining thing that has to be mentioned is that the barrier constructs are highly unportable. Your compiler probably provides _acquire and _release variations for most of the atomic manipulation methods, and these are the sorts of ways you would use them. Depending on the platform you are using (ie: ia32), these may very well be exactly what you would get without the _acquire() or _release() suffixes. Platforms where this matters are ia64 (effectively dead except on HP where its still twitching slightly), and powerpc. ia64 had .acq and .rel instruction modifiers on most load and store instructions (including the atomic ones like cmpxchg). powerpc has separate instructions for this (isync and lwsync give you the read and write barriers respectively).

Now. Having said all this. Do you really have a good reason for going down this path? Doing all this correctly can be very difficult. Be prepared for a lot of self doubt and insecurity in code reviews and make sure you have a lot of high concurrency testing with all sorts of random timing scenerios. Use a critical section unless you have a very very good reason to avoid it, and don't write that critical section yourself.

Do concurrent interlocked and reads require a memory barrier or locking?

Is a memory barrier or a lock before return _counter; necessary?

Yes, absolutely. Consider the following code.

while (foo.Counter == 0)
{
  // Do something
}

The problem is that if the contents inside the loop are simple enough then the C# compiler, JIT compiler, or hardware will optimize the code in this manner.

int register = foo._counter;
while (register == 0)
{
  // Do something
}

Or even this.

if (foo._counter == 0)
{
  START: 
  // Do something
  goto START;
}

Notice that I use _counter instead of Counter as a way of implying that the property will probably be inlined. Then, more importantly, the JIT compiler may "lift" the read of _counter outside of the loop so that it is only read once.

Memory barriers do not provide a freshness guarantee per se. What they do is prevent certain software or hardware optimizations that reorder reads and writes to memory. The freshness guarantee is more of a side effect really.

So to wrap things up your Counter property should look like this.

public int Counter 
{
    get { return Interlocked.CompareExchange(ref _counter, 0, 0); }
}

Does one need volatile and/or memory barrier with the following example using Interlocked.CompareExchange?

No, you do not need use volatile or memory barrier, because Interlocked.CompareExchange method implicitly generate full fences, which prevent instruction reordering and variable caching.

I recommend read this:

http://www.albahari.com/threading/part4.aspx
http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/

Difference between Interlocked.Exchange and Volatile.Write?

the Interlocked.Exchange uses a processor instruction that guarantees an atomic operation.

The Volatile.Write does the same but it also includes a memory barrier operation.
I think Microsoft added Volatile.Write on DotNet 4.5 due to support of ARM processors on Windows 8. Intel and ARM processors differs on memory operation reordering.

On Intel, you have a guarantee that memory access operations will be done in the same order they are issued, or at least that a write operation won't be reordered.

From Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8:

8.2.2 Memory Ordering in P6 and More Recent Processor Families The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, and P6 family
processors also use a processor-ordered memory-ordering model that can
be further defined as “write ordered with store-buffer forwarding.”
This model can be characterized as follows.

On ARM you don't have this kind of guarantee, so a memory barrier is required. An ARM blog explaining this can be found here: http://blogs.arm.com/software-enablement/594-memory-access-ordering-part-3-memory-access-ordering-in-the-arm-architecture/

In your example, as the operation with double is not guaranteed to be atomic, I would recommend a lock to access it. Remember that you have to use the lock on both parts of your code, when reading and setting the value.

A more complete example would be better to answer your question, as it is not clear what happens after these values are set. For a vector, if you have more readers than writers, consider the use of a ReaderWriterLockSlim object: http://msdn.microsoft.com/en-us/library/system.threading.readerwriterlockslim.aspx

The number of threads and the frequency of read/writes can change dramatically your locking strategy.

Interlocked.CompareExchange instruction reodering of the initialvalue

A CPU never "reorders" instructions in way which may affect logic of a single-threaded execution. In case when

initialValue = totalValue;
computedValue = initialValue + addend;

the second operation is definitely depends on value set in the previous operation. A CPU "understands" this from its single-threaded logic perspective, so this sequence never will be reordered. However the following sequences can be reordered:

initialValue = totalValue;
anotherValue = totalValue;

varToInitialize = someVal;
initialized = true;

As you can see the single-core execution won't be affected but on multiple cores this can impose some problems. For example if we build our logic around the fact that if variable initialized is set to true then the varToInitialize should be initialized with some value we can be in trouble on multi-core environment:

if (initialized)
{
    var storageForVal = varToInitialize; // can still be not initalized
    ...
    // do something with storageForVal with assumption that we have correct value
}

As for the local variables. The problem of reordering is problem of global visibility, that is visibility of changes made by one core/CPU to other cores/CPUs. The local variables are mainly tend to be visible only by single thread (apart from some rare scenarios like closures which in the case of exposure outside of a method effectively aren't local variables) so other threads don't have an access to them and thus other cores/CPUs don't require their global visibility. So in other words in vast majority of cases you don't need to worry about local variable operations reordering.

Threading & implicit memory barriers

Aside from the intricacies of using the phrase "fresh read" too loosely then yes, _task will be reacquired from main memory. However, there may be separate and even more subtle problem with your code. Consider an alternate, but exactly equivalent, structure for your code which should make it easier to spot the potential problem.

public void Dispose()
{
    int register = _working;
    if (Interlocked.CompareExchange(ref _working, register, 0) == 1)
    {
        _task.ContinueWith(antecendent => { /*do some other work*/ });
    }
}

The second parameter of CompareExchange is passed by-value so it could be cached in a register. I am envisioning the following scenario.

Thread A calls Run
Thread A does something else with _working that causes it to cache it in a register.
Thread B completes the task and calls Exchange from the ContinueWith delegate.
Thread A calls Dispose.

In the above scenario _working would change to 1 then 0 followed by Dispose flipping it back to a 1 (because that value was cached in a register) without even going into the if statement. At that point _working could be in an inconsistent state.

Personally, I think this scenario is highly unlikely mostly because I do not think _working would get cached in that manner especially if you always made sure to protect accesses to it with interlocked operations.

If nothing else I hope it gives you some food for thought regarding how complicated lock-free techniques can get.

Memory barriers vs. interlocked operations

You are right that a memory barrier cannot ensure that a read sees up-to-date values. What it does do is enforce an ordering between operations, both on a single thread, and between threads.

For example, if thread A does a series of stores and then executes a release barrier before a final store to a flag location, and thread B reads from the flag location and then executes an acquire barrier before reading the other values then the other variables will have the values stored by thread A:

// initially x=y=z=flag=0

// thread A
x=1;
y=2;
z=3;
release_barrier();
flag=1;

// thread B
while(flag==0) ; // loop until flag is 1
acquire_barrier();
assert(x==1);  // asserts will not fire
assert(y==2);
assert(z==3);

Of course, you need to ensure that the load and store to flag is atomic (which simple load and store instructions are on most common CPUs, provided the variables are suitably aligned). Without the while loop on thread B, thread B may well read a stale value (0) for flag, and thus you cannot guarantee any of the values read for the other variables.

Fences can thus be used to enforce synchronization in Dekker's algorithm.

Here's an example implementation in C++ (using C++0x atomic variables):

std::atomic<bool> flag0(false),flag1(false);
std::atomic<int> turn(0);

void p0()
{
    flag0.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag1.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 0)
        {
            flag0.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 0)
            {
            }
            flag0.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);

    // critical section

    turn.store(1,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag0.store(false,std::memory_order_relaxed);
}

void p1()
{
    flag1.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag0.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 1)
        {
            flag1.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 1)
            {
            }
            flag1.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);

    // critical section

    turn.store(0,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag1.store(false,std::memory_order_relaxed);
}

For a full analysis see my blog entry at http://www.justsoftwaresolutions.co.uk/threading/implementing_dekkers_algorithm_with_fences.html

Does Interlocked.Compareexchange Use a Memory Barrier