How to Implement Lock Free Map in C++

Is it possible to implement lock free map in C++

Actually there's a way, although I haven't implemented it myself there's a paper on a lock free map using hazard pointers from eminent C++ expert Andrei Alexandrescu.

Fine-grained locks on C++ map of maps

Dividing your map into multiple maps, each of which can be locked independently, is a good and practical approach. The key to doing this efficiently is to remember that, while you may have 10s of thousands of entries in your map, you probably don't have that many threads or that many cores.

If your machine has, say 8 cores, then hashing your keys into 64 different buckets, each with its own map and mutex, will ensure that contention is unlikely and won't significantly slow down the application.

At most 8 cores could be trying to insert at the same time, and even if they were doing that constantly, they'd only be blocked 12% of the time. Your threads will probably have lots of other things to do, though, so real contention would be much less than that.

As @Eric indicates, the magic Google words for this are "lock striping".

Regarding false sharing of adjacent mutexes: std::map insertions aren't fast enough for that to be a real problem.

One thing you might have to worry about is contention in the memory allocator that is used to allocate map nodes. They all come from the same heap, after all.

Algorithm for a simple non-locking-on-read concurrent map?

Your code contains a data race on the std::shared_ptr, and the behaviour of programs with data races is undefined in C++.

The problem is: The class std::shared_ptr is not thread-safe. However, there are special atomic operations for std::shared_ptr, which can be used to solve the problem.

You can find more information about these atomic operations on the following webpage:

http://en.cppreference.com/w/cpp/memory/shared_ptr/atomic

Trying to build lock-free data structure C++

This wont work.

There is nothing stopping a thread accessing _userThreadToData [outside of a lock] whilst another is mutating the same member.

Consider thread '1' gets all the way and is about to run this line:

auto it =  _userThreadToData.find (std::this_thread::get_id());

Then thread '2' comes along and runs everything up to and including:

_userThreadToData.emplace(threadToRegister, std::unique_ptr<UserData> (new UserData));

You say that you dont access the data unless there "are no incoming requests for registration" but thats not the case - you check whether there are incoming requests and then sometime later you access the map. To write real lock free code (which is very hard) that check and access must be an atomic operation.

Just use locks.

Lock-free Progress Guarantees in a circular buffer queue

This queue data structure is not strictly lock-free by what I consider the most reasonable definition. That definition is something like:

A structure is lock-free if only if any thread can be indefinitely
suspended at any point while still leaving the structure usable by the
remaining threads.

Of course this implies a suitable definition of usable, but for most structures this is fairly simple: the structure should continue to obey its contracts and allow elements to be inserted and removed as expected.

In this case a thread that has succeeded in incrementing m_write_increment, but hasn't yet written s.sequence_number leaves the container in what will soon be an unusable state. If such a thread is killed, the container will eventually report both "full" and "empty" to push and pop respectively, violating the contract of a fixed size queue.

There is a hidden mutex here (the combination of m_write_index and the associated s.sequence_number) - but it basically works like a per-element mutex. So the failure only becomes apparent to writers once you've looped around and a new writer tries to get the mutex, but in fact all subsequent writers have effectively failed to insert their element into the queue since no reader will ever see it.

Now this doesn't mean this is a bad implementation of a concurrent queue. For some uses it may behave mostly as if it was lock free. For example, this structure may have most of the useful performance properties of a truly lock-free structure, but at the same time it lacks some of the useful correctness properties. Basically the term lock-free usually implies a whole bunch of properties, only a subset of which will usually be important for any particular use. Let's look at them one by one and see how this structure does. We'll broadly categorize them into performance and functional categories.

Performance

Uncontended Performance

The uncontended or "best case" performance is important for many structures. While you need a concurrent structure for correctness, you'll usually still try to design your application so that contention is kept to a minimum, so the uncontended cost is often important. Some lock-free structures help here, by reducing the number of expensive atomic operations in the uncontended fast-path, or avoiding a syscall.

This queue implementation does a reasonable job here: there is only a single "definitely expensive" operation: the compare_exchange_weak, and a couple of possibly expensive operations (the memory_order_acquire load and memory_order_release store)¹, and little other overhead.

This compares to something like std::mutex which would imply something like one atomic operation for lock and another for unlock, and in practice on Linux the pthread calls have non-negligible overhead as well.

So I expect this queue to perform reasonably well in the uncontended fast-path.

Contended Performance

One advantage of lock-free structures is that they often allow better scaling when a structure is heavily contended. This isn't necessarily an inherent advantage: some lock-based structures with multiple locks or read-write locks may exhibit scaling that matches or exceeds some lock-free approaches, but it is usually that case that lock-free structures exhibit better scaling that a simple one-lock-to-rule-them-all alternative.

This queue performs reasonably in this respect. The m_write_index variable is atomically updated by all readers and will be a point of contention, but the behavior should be reasonable as long as the underlying hardware CAS implementation is reasonable.

Note that a queue is generally a fairly poor concurrent structure since inserts and removals all happen at the same places (the head and the tail), so contention is inherent in the definition of the structure. Compare this to a concurrent map, where different elements have no particular ordered relationship: such a structure can offer efficient contention-free simultaneous mutation if different elements are being accessed.

Context-switch Immunity

One performance advantage of lock-free structures that is related to the core definition above (and also to the functional guarantees) is that a context switch of a thread which is mutating the structure doesn't delay all the other mutators. In a heavily loaded system (especially when runnable threads >> available cores), a thread may be switched out for hundreds of milliseconds or seconds. During this time, any concurrent mutators will block and incur additional scheduling costs (or they will spin which may also produce poor behavior). Even though such "unluckly scheduling" may be rare, when it does occur the entire system may incur a serious latency spike.

Lock-free structures avoid this since there is no "critical region" where a thread can be context switched out and subsequently block forward progress by other threads.

This structure offers partial protection in this area — the specifics of which depend on the queue size and application behavior. Even if a thread is switched out in the critical region between the m_write_index update and the sequence number write, other threads can continue to push elements to the queue as long as they don't wrap all the way around to the in-progress element from the stalled thread. Threads can also pop elements, but only up to the in-progress element.

While the push behavior may not be a problem for high-capacity queues, the pop behavior can be a problem: if the queue has a high throughput compared to the average time a thread is context switched out, and the average fullness, the queue will quickly appear empty to all consumer threads, even if there are many elements added beyond the in-progress element. This isn't affected by the queue capacity, but simply the application behavior. It means that the consumer side may completely stall when this occurs. In this respect, the queue doesn't look very lock-free at all!

Functional Aspects

Async Thread Termination

On advantage of lock-free structures it they are safe for use by threads that may be asynchronously canceled or may otherwise terminate exceptionally in the critical region. Cancelling a thread at any point leaves the structure is a consistent state.

This is not the case for this queue, as described above.

Queue Access from Interrupt or Signal

A related advantage is that lock-free structures can usually be examined or mutated from an interrupt or signal. This is useful in many cases where an interrupt or signal shares a structure with regular process threads.

This queue mostly supports this use case. Even if the signal or interrupt occurs when another thread is in the critical region, the asynchronous code can still push an element onto the queue (which will only be seen later by consuming threads) and can still pop an element off of the queue.

The behavior isn't as complete as a true lock-free structure: imagine a signal handler with a way to tell the remaining application threads (other than the interrupted one) to quiesce and which then drains all the remaining elements of the queue. With a true lock-free structure, this would allow the signal handler to full drain all the elements, but this queue might fail to do that in the case a thread was interrupted or switched out in the critical region.

¹ In particular, on x86, this will only use an atomic operation for the CAS as the memory model is strong enough to avoid the need for atomics or fencing for the other operations. Recent ARM can do acquire and release fairly efficiently as well.