Lock-Free Progress Guarantees in a Circular Buffer Queue

Lock-free Progress Guarantees in a circular buffer queue

This queue data structure is not strictly lock-free by what I consider the most reasonable definition. That definition is something like:

A structure is lock-free if only if any thread can be indefinitely
suspended at any point while still leaving the structure usable by the
remaining threads.

Of course this implies a suitable definition of usable, but for most structures this is fairly simple: the structure should continue to obey its contracts and allow elements to be inserted and removed as expected.

In this case a thread that has succeeded in incrementing m_write_increment, but hasn't yet written s.sequence_number leaves the container in what will soon be an unusable state. If such a thread is killed, the container will eventually report both "full" and "empty" to push and pop respectively, violating the contract of a fixed size queue.

There is a hidden mutex here (the combination of m_write_index and the associated s.sequence_number) - but it basically works like a per-element mutex. So the failure only becomes apparent to writers once you've looped around and a new writer tries to get the mutex, but in fact all subsequent writers have effectively failed to insert their element into the queue since no reader will ever see it.

Now this doesn't mean this is a bad implementation of a concurrent queue. For some uses it may behave mostly as if it was lock free. For example, this structure may have most of the useful performance properties of a truly lock-free structure, but at the same time it lacks some of the useful correctness properties. Basically the term lock-free usually implies a whole bunch of properties, only a subset of which will usually be important for any particular use. Let's look at them one by one and see how this structure does. We'll broadly categorize them into performance and functional categories.

Performance

Uncontended Performance

The uncontended or "best case" performance is important for many structures. While you need a concurrent structure for correctness, you'll usually still try to design your application so that contention is kept to a minimum, so the uncontended cost is often important. Some lock-free structures help here, by reducing the number of expensive atomic operations in the uncontended fast-path, or avoiding a syscall.

This queue implementation does a reasonable job here: there is only a single "definitely expensive" operation: the compare_exchange_weak, and a couple of possibly expensive operations (the memory_order_acquire load and memory_order_release store)¹, and little other overhead.

This compares to something like std::mutex which would imply something like one atomic operation for lock and another for unlock, and in practice on Linux the pthread calls have non-negligible overhead as well.

So I expect this queue to perform reasonably well in the uncontended fast-path.

Contended Performance

One advantage of lock-free structures is that they often allow better scaling when a structure is heavily contended. This isn't necessarily an inherent advantage: some lock-based structures with multiple locks or read-write locks may exhibit scaling that matches or exceeds some lock-free approaches, but it is usually that case that lock-free structures exhibit better scaling that a simple one-lock-to-rule-them-all alternative.

This queue performs reasonably in this respect. The m_write_index variable is atomically updated by all readers and will be a point of contention, but the behavior should be reasonable as long as the underlying hardware CAS implementation is reasonable.

Note that a queue is generally a fairly poor concurrent structure since inserts and removals all happen at the same places (the head and the tail), so contention is inherent in the definition of the structure. Compare this to a concurrent map, where different elements have no particular ordered relationship: such a structure can offer efficient contention-free simultaneous mutation if different elements are being accessed.

Context-switch Immunity

One performance advantage of lock-free structures that is related to the core definition above (and also to the functional guarantees) is that a context switch of a thread which is mutating the structure doesn't delay all the other mutators. In a heavily loaded system (especially when runnable threads >> available cores), a thread may be switched out for hundreds of milliseconds or seconds. During this time, any concurrent mutators will block and incur additional scheduling costs (or they will spin which may also produce poor behavior). Even though such "unluckly scheduling" may be rare, when it does occur the entire system may incur a serious latency spike.

Lock-free structures avoid this since there is no "critical region" where a thread can be context switched out and subsequently block forward progress by other threads.

This structure offers partial protection in this area — the specifics of which depend on the queue size and application behavior. Even if a thread is switched out in the critical region between the m_write_index update and the sequence number write, other threads can continue to push elements to the queue as long as they don't wrap all the way around to the in-progress element from the stalled thread. Threads can also pop elements, but only up to the in-progress element.

While the push behavior may not be a problem for high-capacity queues, the pop behavior can be a problem: if the queue has a high throughput compared to the average time a thread is context switched out, and the average fullness, the queue will quickly appear empty to all consumer threads, even if there are many elements added beyond the in-progress element. This isn't affected by the queue capacity, but simply the application behavior. It means that the consumer side may completely stall when this occurs. In this respect, the queue doesn't look very lock-free at all!

Functional Aspects

Async Thread Termination

On advantage of lock-free structures it they are safe for use by threads that may be asynchronously canceled or may otherwise terminate exceptionally in the critical region. Cancelling a thread at any point leaves the structure is a consistent state.

This is not the case for this queue, as described above.

Queue Access from Interrupt or Signal

A related advantage is that lock-free structures can usually be examined or mutated from an interrupt or signal. This is useful in many cases where an interrupt or signal shares a structure with regular process threads.

This queue mostly supports this use case. Even if the signal or interrupt occurs when another thread is in the critical region, the asynchronous code can still push an element onto the queue (which will only be seen later by consuming threads) and can still pop an element off of the queue.

The behavior isn't as complete as a true lock-free structure: imagine a signal handler with a way to tell the remaining application threads (other than the interrupted one) to quiesce and which then drains all the remaining elements of the queue. With a true lock-free structure, this would allow the signal handler to full drain all the elements, but this queue might fail to do that in the case a thread was interrupted or switched out in the critical region.

¹ In particular, on x86, this will only use an atomic operation for the CAS as the memory model is strong enough to avoid the need for atomics or fencing for the other operations. Recent ARM can do acquire and release fairly efficiently as well.

Truly Lock-free MPMC Ring Buffer? Threads able to assist each other to avoid blocking?

As a good example of how cross-thread assist often ends up working in real life, consider that a lock-free MPMC queue can be obtained by changing the liblfds algorithm along these lines:

Use 3 counters:

alloc_pos: the total number of push operations that have been started. This is incremented atomically when a push starts.
write_pos: all write operations at lower positions are known to be complete.
read_pos: all items written at lower positions are known to have been consumed.

In this scheme, a push or pop operation is completed by a CAS in the affected slot. The write_pos and read_pos variables are redundant.

So to push, a thread first increments alloc_pos, and then increments write_pos past all the slots in front of it that it can see are complete. This is an assist -- it is completing previous writes that were started in other threads. Then the thread has to scan the slots between write_pos and alloc_pos until it finds a free one and manages to reserve it with a CAS.

To pop, the reader first increments read_pos past all the items written at lower positions that it can see are consumed. Again this is an assist -- completing previous reads. Then it scans from read_pos to alloc_pos to see if it can find an item that is finished being written.

As mentioned in comments, doing this for real gets annoying, with implementation decisions trading off performance against which ordering and availability guarantees you need, as well as jumping through hoops to prevent ABA problems.

The definition of lock-free

Your quote from Concurrency in Action is taken out of context.

In fact, what the book actually says is:

7.1 Definitions and consequences
Algorithms and data structures that use mutexes, condition variables,
and futures to synchronize the data are called blocking data
structures and algorithms.
Data structures and algorithms that don’t use blocking library
functions are said to be nonblocking. Not all these data structures
are lock-free ...

Then it proceeds to further break down nonblocking algorithms into Obstruction-Free, Lock-Free and Wait-Free.

So a Lock-Free algorithm is

a nonblocking algorithm (it does not use locks like a mutex) and
it is able to make progress in at least one thread in a bounded number of steps.

So both Herb Sutter and Anthony Williams are correct.

Spurious underflow in C++ lock-free queue implementation

You have two bugs, one of which can cause the failure you observe.

Let's look at your push code, except we'll allow only one operation per statement:

void push(T t)
{
    auto const claimed_index = m_wr_ptr++;               /* 1 */
    auto const claimed_offset = claimed_index & m_mask; /* 2 */
    auto& claimed_data = m_data[claimed_offset];         /* 3 */
    claimed_data.store(t);                               /* 4 */
    m_size++;                                            /* 5 */
}

Now, for a queue with two producers, there is a window of vulnerability to a race condition between operations 1 and 4:

Before:

m_rd_ptr == 1
m_wr_ptr == 1
m_size == 0

Producer A:

/* 1 */ claimed_index = 1; m_wr_ptr = 2;
/* 2 */ claimed_offset = 1;

Scheduler puts Producer A to sleep here

Producer B:

/* 1 */ claimed_index = 2; m_wr_ptr = 3;
/* 2 */ claimed_offset = 2;
/* 3 */ claimed_data = m_data[2];
/* 4 */ claimed_data.store(t);
/* 5 */ m_size = 1;

After:

m_size == 1
m_rd_ptr == 1
m_wr_ptr == 3
m_data[1] == 0xDEADBEAF
m_data[2] == value_produced_by_B

The consumer now runs, sees m_size > 0, and reads from m_data[1] while increasing m_rd_ptr from 1 to 2. But m_data[1] hasn't been written by Producer A yet, and Producer B wrote to m_data[2].

The second bug is the complementary case in pop() when a consumer thread is interrupted between the m_rd_ptr++ action and the .load() call. It can result in reading values out of order, potentially so far out of order that the queue has completely circled and overwritten the original value.

Just because two operations in a single source statement are atomic does not make the entire statement atomic.

Lock free single producer/single consumer circular buffer - Can CPU speculation break the memory barrier logic?

re: your proposed reordering: no, the compiler can't invent writes to atomic variables.

Runtime speculation also can't invent writes that actually become visible to other threads. It can put whatever it wants in its own private store buffer, but the correctness of earlier branches must be checked before a store can become visible to other threads.

Normally this works by in-order retirement: an instruction can only retire (become non-speculative) once all previous instructions are retired/non-speculative. A store can't commit from the store buffer to L1d cache until after the store instruction retires.

re: the title: no, speculative execution still has to respect the memory model. If a CPU wants to speculatively load past an incomplete acquire-load, it can, but only if it checks to make sure those load results are still valid when they're "officially" allowed to happen.

x86 CPUs do in practice do this, because the strong x86 memory model means that all loads are acquire-loads, so any out-of-order loading has to be speculative and rolled back if it's not valid. (This is why you can get memory-order mis-speculation pipeline nukes.)

So asm works the way the ISA rules say it works, and C++ compilers know that. Compilers use this to implement the C++ memory model on top of the target ISA.

If you do an acquire-load in C++, it really works as an acquire-load.

You can mentally model your logic for possible compile-time + run-time reordering according to the C++ reordering rules as written. See http://preshing.com/20120913/acquire-and-release-semantics/.

Lock free single producer/single consumer circular buffer

we need prove that

_array[current_tail] = item; // push(3)

excecuted after conforming (current_head == current_tail)

item = _array[current_head]; // pop(3)

is completed. we can overwrite cell, only after data from it already copied to item

the

_head.load(std::memory_order_acquire) // push(2)

synchronized with

_head.store(increment(current_head), std::memory_order_release);   //pop(4)

via Release-Acquire ordering:

all memory writes ( pop(3) ) that happened-before the atomic store-release ( pop(4) ) on _head become visible side-effects once the atomic load-acquire ( push(2) ) is completed on _head.

so Producer code after push(2) is completed, guaranteed to see result of pop(3). this mean that data from _array[current_head] is copied to item and result of this operation is visible for Producer code after push(2), so _array[current_head] already free.

from another side from memory_order_acquire load description - no reads or writes ( push(3) ) in the current thread can be reordered before this load. so push(3) will be execute already after push(2) load complete, but at this point pop(3) already completed

item = _array[current_head];                                        //pop(3)
_head.store(increment(current_head), std::memory_order_release);    //pop(4)
-----
    _head.load(std::memory_order_acquire);                          //push(2)
    _array[current_tail] = item;                                    //push(3)

Lock-Free Progress Guarantees in a Circular Buffer Queue