Lock-Free Multi-Threading Is for Real Threading Experts

Lock-free multi-threading is for real threading experts

Current "lock-free" implementations follow the same pattern most of the time:

read some state and make a copy of it*
modify copy*
do an interlocked operation
retry if it fails

_{(*optional: depends on the data structure/algorithm)}

The last bit is eerily similar to a spinlock. In fact, it is a basic spinlock. :)

I agree with @nobugz on this: the cost of the interlocked operations used in lock-free multi-threading is dominated by the cache and memory-coherency tasks it must carry out.

What you gain however with a data structure that is "lock-free" is that your "locks" are very fine grained. This decreases the chance that two concurrent threads access the same "lock" (memory location).

The trick most of the time is that you do not have dedicated locks - instead you treat e.g. all elements in an array or all nodes in a linked list as a "spin-lock". You read, modify and try to update if there was no update since your last read. If there was, you retry.

This makes your "locking" (oh, sorry, non-locking :) very fine grained, without introducing additional memory or resource requirements.

Making it more fine-grained decreases the probability of waits. Making it as fine-grained as possible without introducing additional resource requirements sounds great, doesn't it?

Most of the fun however can come from ensuring correct load/store ordering.

Contrary to one's intuitions, CPUs are free to reorder memory reads/writes - they are very smart, by the way: you will have a hard time observing this from a single thread. You will, however run into issues when you start to do multi-threading on multiple cores. Your intuitions will break down: just because an instruction is earlier in your code, it does not mean that it will actually happen earlier. CPUs can process instructions out of order: and they especially like to do this to instructions with memory accesses, to hide main memory latency and make better use of their cache.

Now, it is sure against intuition that a sequence of code does not flow "top-down", instead it runs as if there was no sequence at all - and may be called "devil's playground". I believe it is infeasible to give an exact answer as to what load/store re-orderings will take place. Instead, one always speaks in terms of mays and mights and cans and prepare for the worst. "Oh, the CPU might reorder this read to come before that write, so it is best to put a memory barrier right here, on this spot."

Matters are complicated by the fact that even these mays and mights can differ across CPU architectures. It might be the case, for example, that something that is guaranteed to not happen in one architecture might happen on another.

To get "lock-free" multi-threading right, you have to understand memory models.

Getting the memory model and guarantees correct is not trivial however, as demonstrated by this story, whereby Intel and AMD made some corrections to the documentation of MFENCE causing some stir-up among JVM developers. As it turned out, the documentation that developers relied on from the beginning was not so precise in the first place.

Locks in .NET result in an implicit memory barrier, so you are safe using them (most of the time, that is... see for example this Joe Duffy - Brad Abrams - Vance Morrison greatness on lazy initialization, locks, volatiles and memory barriers. :) (Be sure to follow the links on that page.)

As an added bonus, you will get introduced to the .NET memory model on a side quest. :)

There is also an "oldie but goldie" from Vance Morrison: What Every Dev Must Know About Multithreaded Apps.

...and of course, as @Eric mentioned, Joe Duffy is a definitive read on the subject.

A good STM can get as close to fine-grained locking as it gets and will probably provide a performance that is close to or on par with a hand-made implementation.
One of them is STM.NET from the DevLabs projects of MS.

If you are not a .NET-only zealot, Doug Lea did some great work in JSR-166.

Cliff Click has an interesting take on hash tables that does not rely on lock-striping - as the Java and .NET concurrent hash tables do - and seem to scale well to 750 CPUs.

If you are not afraid to venture into Linux territory, the following article provides more insight into the internals of current memory architectures and how cache-line sharing can destroy performance: What every programmer should know about memory.

@Ben made many comments about MPI: I sincerely agree that MPI may shine in some areas. An MPI based solution can be easier to reason about, easier to implement and less error-prone than a half-baked locking implementation that tries to be smart. (It is however - subjectively - also true for an STM based solution.) I would also bet that it is light-years easier to correctly write a decent distributed application in e.g. Erlang, as many successful examples suggest.

MPI, however has its own costs and its own troubles when it is being run on a single, multi-core system. E.g. in Erlang, there are issues to be solved around the synchronization of process scheduling and message queues.

Also, at their core, MPI systems usually implement a kind of cooperative N:M scheduling for "lightweight processes". This for example means that there is an inevitable context switch between lightweight processes. It is true that it is not a "classic context switch" but mostly a user space operation and it can be made fast - however I sincerely doubt that it can be brought under the 20-200 cycles an interlocked operation takes. User-mode context switching is certainly slower even in the the Intel McRT library.
N:M scheduling with light-weight processes is not new. LWPs were there in Solaris for a long time. They were abandoned. There were fibers in NT. They are mostly a relic now. There were "activations" in NetBSD. They were abandoned. Linux had its own take on the subject of N:M threading. It seems to be somewhat dead by now.

From time to time, there are new contenders: for example McRT from Intel, or most recently User-Mode Scheduling together with ConCRT from Microsoft.

At the lowest level, they do what an N:M MPI scheduler does. Erlang - or any MPI system -, might benefit greatly on SMP systems by exploiting the new UMS.

I guess the OP's question is not about the merits of and subjective arguments for/against any solution, but if I had to answer that, I guess it depends on the task: for building low level, high performance basic data structures that run on a single system with many cores, either low-lock/"lock-free" techniques or an STM will yield the best results in terms of performance and would probably beat an MPI solution any time performance-wise, even if the above wrinkles are ironed out e.g. in Erlang.

For building anything moderately more complex that runs on a single system, I would perhaps choose classic coarse-grained locking or if performance is of great concern, an STM.

For building a distributed system, an MPI system would probably make a natural choice.

Note that there are MPI implementations for .NET as well (though they seem to be not as active).

Is the meaning of lock-free even defined by the C++ standard?

In C++11 Standard, the term "lock-free" was not defined well as reported in issue LWG #2075.

C++14 Standard define what lock-free executions is in C++ language (N3927 approved).

Quote C++14 1.10[intro.multithread]/paragraph 4:

Executions of atomic functions that are either defined to be lock-free (29.7) or indicated as lock-free (29.4) are lock-free executions.

If there is only one unblocked thread, a lock-free execution in that thread shall complete. [ Note: Concurrently executing threads may prevent progress of a lock-free execution. For example, this situation can occur with load-locked store-conditional implementations. This property is sometimes termed obstruction-free. -- end note ]

When one or more lock-free executions run concurrently, at least one should complete. [ Note: It is difficult for some implementations to provide absolute guarantees to this effect, since repeated and particularly inopportune interference from other threads may prevent forward progress, e.g., by repeatedly stealing a cache line for unrelated purposes between load-locked and store-conditional instructions. Implementations should ensure that such effects cannot indefinitely delay progress under expected operating conditions, and that such anomalies can therefore safely be ignored by programmers. Outside this International Standard, this property is sometimes termed lock-free. -- end note ]

Above definition of "lock-free" depends on what does unblocked thread behave. C++ Standard does not define unblocked thread directly, but 17.3.3[defns.blocked] defines blocked thread:

a thread that is waiting for some condition (other than the availability of a processor) to be satisfied before it can continue execution

(How) can the lock-free-ness of atomics affect program semantics?

I think the answer is NO, except signal handler as paxdiablo's answer, when "program semantics" mean the side effects of atomic operations.
The lock-free-ness of atomic affect the strength of progress guarantee for whole multithreading program.
When two (or more) threads concurrently execute lock-free atomic operations on same object, at least one of these operations should complete under any worst thread scheduling.
In other words, 'evil' thread scheduler could intentionally block progress of lock-based atomic operations in theory.

Lock free synchronization

Here are some general approaches that can minimize the use of locks, assuming your algorithm has some particular exploitable features:

When updating a single numeric variable, you can use non-blocking primitives such as CAS, atomic_increment, etc. They are usually much faster that a classic blocking critical section (lock, mutex).
When a data structure is read by multiple threads, but only written by one or few threads, an obvious solution would be a read-write lock, instead of a full lock.
Try to exploit fine grain locking. For example, instead of locking an entire data structure with a single lock, see if you can use multiple different locks to protect distinct sections of the data structure.
If you're relying on the implicit memory fence effect of locks to ensure visibility of a single variable across threads, just use volatile¹, if available.
Sometimes, using a conditional variable (and associated lock) is too slow in practice. In this case, a volatile busy spin is much more efficient.

More good advice on this topic here: http://software.intel.com/en-us/articles/intel-guide-for-developing-multithreaded-applications/

A nice read in another SO question: Lock-free multi-threading is for real threading experts (don't be scared by the title).

And a recently discussed lock-free Java implementation of atomic_decrement: Starvation in non-blocking approaches

¹ The use of volatile here applies to languages such as Java where volatile has defined semantics in the memory model, but not to C or C++ where volatile preceded the introduction of the cross-thread memory model and doesn't integrate with it. Similar constructs are available in those languages, such as the various std::memory_order specifiers in C++.

Ensure thread safety without lock statement in C#

Yes, the Interlocked class of functions are the way to go for lock-free multi-threaded synchronization. In your specific case, you want Interlocked.Increment

How can I write a lock free structure?

Short answer is:

You cannot.

Long answer is:

If you are asking this question, you do not probably know enough to be able to create a lock free structure. Creating lock free structures is extremely hard, and only experts in this field can do it. Instead of writing your own, search for an existing implementation. When you find it, check how widely it is used, how well is it documented, if it is well proven, what are the limitations - even some lock free structure other people published are broken.

If you do not find a lock free structure corresponding to the structure you are currently using, rather adapt the algorithm so that you can use some existing one.

If you still insist on creating your own lock free structure, be sure to:

start with something very simple
understand memory model of your target platform (including read/write reordering constraints, what operations are atomic)
study a lot about problems other people encountered when implementing lock free structures
do not just guess if it will work, prove it
heavily test the result

Lock-free, awaitable, exclusive access methods

Update: great article related

.: Creating High-Performance Locks and Lock-free Code (for .NET) :.

The main point about lock-free algorythms is not that they are for experts.

The main point is Do you really need lock-free algorythm here? I can't understand your logic here:

Since it does not make sense to initialize more than once, whether it be from multiple threads or a single one, multiple calls should return immediately (or even throw an exception).
Why can't your users simply wait for a result of initialization, and use your resource after that? If your can, simply use the Lazy<T> class or even Asynchronous Lazy Initialization.
You really should read about consensus number and CAS-operations and why does it matters while implementing your own synchronization primitive.
In your code your are using the Interlocked.Exchange method, which isn't CAS in real, as it always exchanges the value, and it has a consensus number equal to 2. This means that the primitive using such construction will work correctly only for 2 threads (not in your situation, but still 2).
I've tried to define is your code works correctly for 3 threads, or there can be some circumstances which lead your application to damaged state, but after 30 minutes I stopped. And any your team member will stop like me after some time trying to understand your code. This is a waste of time, not only yours, but your team. Don't reinvent the wheel until you really have to.
My favorite book in related area is Writing High-Performance .NET Code by Ben Watson, and my favorite blog is Stephen Cleary's. If you can be more specific about what kind of book are you interested in, I can add some more references.
No locks in program doesn't make your application lock-free. In .NET application you really should not use the Exceptions for your internal program flow. Consider that the initializing thread isn't scheduled for a while by the OS (on various reasons, no matter what they are exactly).
In this case all other threads in your app will die step by step trying to access your shared resource. I can't say that this is a lock-free code. Yes, there are no locks in it, but it doesn't guarantee the correctness of the program and thus it isn't a lock-free by definition.

Lock-Free Multi-Threading Is for Real Threading Experts