Circular Lock-Free Buffer

Lock free single producer/single consumer circular buffer

we need prove that

_array[current_tail] = item; // push(3)

excecuted after conforming (current_head == current_tail)

item = _array[current_head]; // pop(3)

is completed. we can overwrite cell, only after data from it already copied to item

the

_head.load(std::memory_order_acquire) // push(2)

synchronized with

_head.store(increment(current_head), std::memory_order_release);   //pop(4)

via Release-Acquire ordering:

all memory writes ( pop(3) ) that happened-before the atomic store-release ( pop(4) ) on _head become visible side-effects once the atomic load-acquire ( push(2) ) is completed on _head.

so Producer code after push(2) is completed, guaranteed to see result of pop(3). this mean that data from _array[current_head] is copied to item and result of this operation is visible for Producer code after push(2), so _array[current_head] already free.

from another side from memory_order_acquire load description - no reads or writes ( push(3) ) in the current thread can be reordered before this load. so push(3) will be execute already after push(2) load complete, but at this point pop(3) already completed

item = _array[current_head];                                        //pop(3)
_head.store(increment(current_head), std::memory_order_release);    //pop(4)
-----
    _head.load(std::memory_order_acquire);                          //push(2)
    _array[current_tail] = item;                                    //push(3)

Truly Lock-free MPMC Ring Buffer? Threads able to assist each other to avoid blocking?

As a good example of how cross-thread assist often ends up working in real life, consider that a lock-free MPMC queue can be obtained by changing the liblfds algorithm along these lines:

Use 3 counters:

alloc_pos: the total number of push operations that have been started. This is incremented atomically when a push starts.
write_pos: all write operations at lower positions are known to be complete.
read_pos: all items written at lower positions are known to have been consumed.

In this scheme, a push or pop operation is completed by a CAS in the affected slot. The write_pos and read_pos variables are redundant.

So to push, a thread first increments alloc_pos, and then increments write_pos past all the slots in front of it that it can see are complete. This is an assist -- it is completing previous writes that were started in other threads. Then the thread has to scan the slots between write_pos and alloc_pos until it finds a free one and manages to reserve it with a CAS.

To pop, the reader first increments read_pos past all the items written at lower positions that it can see are consumed. Again this is an assist -- completing previous reads. Then it scans from read_pos to alloc_pos to see if it can find an item that is finished being written.

As mentioned in comments, doing this for real gets annoying, with implementation decisions trading off performance against which ordering and availability guarantees you need, as well as jumping through hoops to prevent ABA problems.

Lock Free Circular Array

A simpler approach is to use a power of 2 size and do the following.

 final double[] array;
 final int sizeMask;
 final AtomicInteger i = new AtomicInteger();

 public CircularBuffer(int size) {
      assert size > 1 && ((size & (size -1)) == 0); // test power of 2.
      array = new double[size];
      sizeMask = size -1;
 }

 void add(double value) {
     array[i.getAndIncrement() & sizeMask] = value;
 }

Lockless circular buffer with single producer singular consumer

Writing multi-threaded code that shares variables without using a mutex is very difficult to get right.
see An Introduction to Lock-Free Programming, Lock Free Buffer.

If you absolutely must avoid using mutexes, then i would highly recommend using a pre-made lock-free queue, like e.g. Boost.lockfree or MPMCQueue as a light non-boost alternative.

I know reading and writing to variables can be non-atomic, so I should use an atomic storage, but those can cause locks.

std::atomic is generally lock-free (doesn't use a mutex) for all primitive types (up to the native size of your cpu).
You can check if std::atomic will use a mutex for a given type by calling std::atomic<T>::is_lock_free

Is the use of atomic storage needed here?

Yes, absolutely. You either need to use mutexes or atomics.

Also, I'd like to eliminate the active wait in the producer, and I thought I could use a std::condition_variable

When you can't use mutexes your only option is to use a spin lock.
If it is allowed in your context, you could use std::this_thread::yield() in the spin lock to reduce CPU load. (however a mutex might be faster then)

Edit:
A potential solution with only 2 atomics would be:

std::atomic<foo*> currentData = new foo();
std::atomic<foo*> newData = new foo();

void consumer() {
    foo* activeData = currentData;
    while (true) {
        activeData->doSomething();
        foo* newItem = currentData;
        if (newItem != activeData) {
            newData = activeData;
            activeData = newItem;
        }
    }
}

void producer() {
    while (true) {
        foo* reusedData = newData;
        if (!reusedData)
            continue;
        newData = nullptr;
        reusedData->init();
        currentData = reusedData;
    }
}

Lock free single producer/single consumer circular buffer - Can CPU speculation break the memory barrier logic?

re: your proposed reordering: no, the compiler can't invent writes to atomic variables.

Runtime speculation also can't invent writes that actually become visible to other threads. It can put whatever it wants in its own private store buffer, but the correctness of earlier branches must be checked before a store can become visible to other threads.

Normally this works by in-order retirement: an instruction can only retire (become non-speculative) once all previous instructions are retired/non-speculative. A store can't commit from the store buffer to L1d cache until after the store instruction retires.

re: the title: no, speculative execution still has to respect the memory model. If a CPU wants to speculatively load past an incomplete acquire-load, it can, but only if it checks to make sure those load results are still valid when they're "officially" allowed to happen.

x86 CPUs do in practice do this, because the strong x86 memory model means that all loads are acquire-loads, so any out-of-order loading has to be speculative and rolled back if it's not valid. (This is why you can get memory-order mis-speculation pipeline nukes.)

So asm works the way the ISA rules say it works, and C++ compilers know that. Compilers use this to implement the C++ memory model on top of the target ISA.

If you do an acquire-load in C++, it really works as an acquire-load.

You can mentally model your logic for possible compile-time + run-time reordering according to the C++ reordering rules as written. See http://preshing.com/20120913/acquire-and-release-semantics/.

Circular buffer Vs. Lock free stack to implement a Free List

Just in case, there is a difference between lock-free and wait-free. The former means there is no locking but the thread can still busy-spin not making any progress. The latter means that the thread always makes progress with no locking or busy-spinning.

With one reader and one writer lock-free and wait-free FIFO circular buffer is trivial to implement.

I hear that LIFO stack can also be made wait-free, but not so sure about FIFO list. And it sound like you need a queue here rather then a stack.

How to distinguish empty from full on a lock-free SPSC ring buffer that overwrites when full?

This is relatively simple as long as your buffer is no bigger than 64kB (so that the pointers fit in 16 bits).

Put your read and write pointers in two 16-bit halves of a 32-bit word. Use LDREX and STREX operations in the writer to atomically adjust the read pointer to discard the oldest byte if the buffer was already full.

Modifying your pseudocode:

void write(data) {
  buf[writeptr] = data;
  tmp = (writeptr + 1) % BUF_SIZE;

  ATOMIC { 
    writeptr = tmp;

    if (readptr == tmp) {

      readptr = (tmp + 1) % BUF_SIZE;
    }
  }
}