Shared-Memory Ipc Synchronization (Lock-Free)

Shared-memory IPC synchronization (lock-free)

Boost Interprocess has support for Shared Memory.

Boost Lockfree has a Single-Producer Single-Consumer queue type (spsc_queue). This is basically what you refer to as a circular buffer.

Here's a demonstration that passes IPC messages (in this case, of type string) using this queue, in a lock-free fashion.

Defining the types

First, let's define our types:

namespace bip = boost::interprocess;
namespace shm
{
    template <typename T>
        using alloc = bip::allocator<T, bip::managed_shared_memory::segment_manager>;

    using char_alloc    =  alloc<char>;
    using shared_string =  bip::basic_string<char, std::char_traits<char>, char_alloc >;
    using string_alloc  =  alloc<shared_string>;

    using ring_buffer = boost::lockfree::spsc_queue<
        shared_string, 
        boost::lockfree::capacity<200> 
        // alternatively, pass
        // boost::lockfree::allocator<string_alloc>
    >;
}

For simplicity I chose to demo the runtime-size spsc_queue implementation, randomly requesting a capacity of 200 elements.

The shared_string typedef defines a string that will transparently allocate from the shared memory segment, so they are also "magically" shared with the other process.

The consumer side

This is the simplest, so:

int main()
{
    // create segment and corresponding allocator
    bip::managed_shared_memory segment(bip::open_or_create, "MySharedMemory", 65536);
    shm::string_alloc char_alloc(segment.get_segment_manager());

    shm::ring_buffer *queue = segment.find_or_construct<shm::ring_buffer>("queue")();

This opens the shared memory area, locates the shared queue if it exists. NOTE This should be synchronized in real life.

Now for the actual demonstration:

while (true)
{
    std::this_thread::sleep_for(std::chrono::milliseconds(10));

    shm::shared_string v(char_alloc);
    if (queue->pop(v))
        std::cout << "Processed: '" << v << "'\n";
}

The consumer just infinitely monitors the queue for pending jobs and processes one each ~10ms.

The Producer side

The producer side is very similar:

int main()
{
    bip::managed_shared_memory segment(bip::open_or_create, "MySharedMemory", 65536);
    shm::char_alloc char_alloc(segment.get_segment_manager());

    shm::ring_buffer *queue = segment.find_or_construct<shm::ring_buffer>("queue")();

Again, add proper synchronization to the initialization phase. Also, you would probably make the producer in charge of freeing the shared memory segment in due time. In this demonstration, I just "let it hang". This is nice for testing, see below.

So, what does the producer do?

    for (const char* s : { "hello world", "the answer is 42", "where is your towel" })
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(250));
        queue->push({s, char_alloc});
    }
}

Right, the producer produces precisely 3 messages in ~750ms and then exits.

Note that consequently if we do (assume a POSIX shell with job control):

./producer& ./producer& ./producer&
wait

./consumer&

Will print 3x3 messages "immediately", while leaving the consumer running. Doing

./producer& ./producer& ./producer&

again after this, will show the messages "trickle in" in realtime (in burst of 3 at ~250ms intervals) because the consumer is still running in the background

See the full code online in this gist: https://gist.github.com/sehe/9376856

SRW lock in shared memory

SRW Locks cannot be shared between processes. This is implied by pointed omission in the opening sentence of the documentation that says

Slim reader/writer (SRW) locks enable the threads of a single process to access shared resources...

These objects take advantage of the fact that they are used within a single process. For example, the threads waiting to enter the lock are tracked in the form of a linked list. This list of waiting threads obviously has to be kept somewhere outside the SRWLock, seeing as the SRWLock is only the size of a single pointer, and you can't put a list of 10 threads inside a single pointer. That linked list won't accessible to other processes.

Synchronized access to data in shared memory between two processes

It is possible, you have to use the flag PTHREAD_PROCESS_SHARED:

pthread_mutexattr_t mattr;
pthread_mutexattr_init(&mattr);
pthread_mutexattr_setpshared(&mattr, PTHREAD_PROCESS_SHARED);

// Init the shared mem barrier
if ( (rv = pthread_mutex_init(&nshared, &mattr)) != 0 ) {
      fprintf(stderr, "Failed to initiliaze the shared mutex.\n");
      return rv;
}

Where the variable nshared is mapped in shared memory.

Take a look at this documentation. Also, keep in mind that the default value for the mutex is to not share it among processes.

also, take a look at these posts post1 post2

Bonus code to chek the status of the mutex:

void showPshared(pthread_mutexattr_t *mta) {
  int           rc;
  int           pshared;

  printf("Check pshared attribute\n");
  rc = pthread_mutexattr_getpshared(mta, &pshared);

  printf("The pshared attributed is: ");
  switch (pshared) {
  case PTHREAD_PROCESS_PRIVATE:
    printf("PTHREAD_PROCESS_PRIVATE\n");
    break;
  case PTHREAD_PROCESS_SHARED:
    printf("PTHREAD_PROCESS_SHARED\n");
    break;
  default :
    printf("! pshared Error !\n");
    exit(1);
  }
  return;
}

I don't remember were I took this piece of code ... found it! here is the source of hal knowledge.

Are lock-free atomics address-free in practice?

Yes, lock-free atomics are address-free on all C++ implementations on all normal CPUs, and can safely be used on shared-memory between processes. Non-lock-free atomics¹ won't be safe between processes, though. Each process will have its own hash table of locks (Where is the lock for a std::atomic?).

The C++ standard intends lock-free atomics to work in shared memory between processes, but it can only go as far as "should" without defining terms and so on.

C++draft 29.5 Lock-free property
[ Note: Operations that are lock-free should also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation should not depend on any per-process state. This restriction enables communication by memory that is mapped into a process more than once and by memory that is shared between two processes. — end note ]

This is a quality-of-implementation recommendation that is very easy to implement on current hardware, and in fact you'd have to try hard to design a deathstation9000 C++ implementation that violates it on x86 / ARM / PowerPC / other mainstream CPU while actually being lock-free.

The mechanism hardware exposes for atomic read-modify-write operations is based on MESI cache coherency which only cares about physical addresses. x86 lock cmpxchg / lock add / etc. makes a core hang on to a cache line in Modified state so no other core can read/write it in the middle of the atomic operation. (Can num++ be atomic for 'int num'?).

Most non-x86 architectures use LL/SC, which lets you write a retry loop that only does the store if it will be atomic. LL/SC can emulate CAS with O(1) overhead in a wait-free manner without introducing addresses.

C++ lock-free atomics compile to use LL/SC instructions directly. See my answer on the num++ question for x86 examples. See Atomically clearing lowest non-zero bit of an unsigned integer for some examples of AArch64 code-gen for compare_exchange_weak vs fetch_add using LL/SC instructions.

Atomic pure-load or pure-store are easier and happen for free with aligned data. On x86, see Why is integer assignment on a naturally aligned variable atomic on x86? Other ISAs have similar rules.

Related: I included some comments about address-free in my answer on Genuinely test std::atomic is lock-free or not. I'm not sure whether they're helpful or correct. :/

Footnote 1:

All mainstream CPUs have lock-free atomics for objects up to the width of a pointer. Some have wider atomics, like x86 has lock cmpxchg16b, but not all implementations choose to use it for double-width lock-free atomic objects. Check C++17 std::atomic::is_always_lock_free, or ATOMIC_xxx_LOCK_FREE if defined, for compile-time detection.

(Some microcontrollers can't hold a pointer in a single register (or copy it around with a single operation), but there aren't usually multi-core implementations of such ISAs.)

Why on earth would an implementation use non-address-free atomics that are lock-free?

I don't know any plausible reason on hardware that works anything like normal modern CPUs. You could may imagine some architecture where you do atomic operations by submitting the address to some

I think the C++ standard wants to avoid constraining non-mainstream implementations as much as possible. e.g. C++ on top of some kind of interpreter, rather than compiled do machine code for a "normal" CPU architecture.

IDK if you could usefully implement C++ on a loosely-coupled shared memory system like a cluster with ethernet links instead of shared memory, or non-coherent shared memory (that has to be flushed explicitly for other threads to see your stores).

I think it's mostly that the C++ committee can't say much about how atomics must be implemented without assuming that implementations will run programs under an OS where multiple processes can set up shared memory.

They might be imagining some future ISA where address-free atomics aren't possible, but I think more likely they don't want to talk about shared-memory between multiple C++ programs. The standard only requires that an implementation run one program.

Apparently std::atomic_flag is actually guaranteed to be address-free Why only std::atomic_flag is guaranteed to be lock-free?, so IDK why they don't make the same requirement for any atomic<T> that the implementation chooses to implement as lock-free.

Shared-Memory Ipc Synchronization (Lock-Free)