How to Implement Aba Counter With C++11 Cas

How can I implement ABA counter with c++11 CAS?

To atomically modify two things at once with a single atomic operation, you need to put them in adjacent memory, e.g. in a two-member struct. Then you can use std::atomic<my_struct> to get gcc to emit lock cmpxchg16b on x86-64, for example.

You don't need inline asm for this, and it's worth a bit of C++ syntax pain to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm.

Unfortunately, with current compilers, you need to use a union to get efficient code for reading just one of the pair. The "obvious" way of doing an atomic load of the struct and then only using one member still leads to a lock cmpxchg16b to read the whole struct even though we only need one member. (Much slower, and dirties the cache line so readers contend with other readers). I'm confident that a normal 64b load of the pointer would still correctly implement the acquire memory-ordering semantics on x86 (as well as atomicity), but current compilers don't do that optimization even for std::memory_order_relaxed, so we trick them into it with a union.

(submitted GCC bug 80835 about this. TODO: same for clang if this is a useful idea.)

Checklist:

Make sure your compiler is generating efficient code for loading just one member in the read-only case, not a lock cmpxchg16b of the pair. e.g. using a union.
Make sure your compiler guarantees that accessing one member of a union after writing a different union member has well-defined behaviour in that implementation. Union type-punning is legal in in C99 (so this should work well with C11 stdatomic), but it's UB in ISO C++11. However, it's legal in the GNU dialect of C++ (supported by gcc, clang, and ICC, among others).
Make sure your object is 16B-aligned, or 8B-aligned for 32-bit pointers. More generally, alignas(2*sizeof(void*)) should work. Misaligned locked instructions can be very slow on x86, especially if they cross a cache-line boundary. clang3.8 even compiles it to a library call if the object isn't aligned.
Compile with -mcx16 for x86-64 builds. cmpxchg16b was not supported by the earliest x86-64 CPUs (AMD K8), but should be on everything after that. Without -mcx16, you get a library function call (which probably uses a global lock). The 32-bit equivalent, cmpxchg8b, is old enough that modern compilers assume its support. (And can use SSE, MMX, or even x87 to do 64b atomic loads/stores, so using a union is somewhat less important for good performance when reading one member).
Make sure that a pointer+uintptr_t atomic object is lock-free. This is pretty much guaranteed for x32 and 32-bit ABIs (8B object), but not for 16B objects. e.g. MSVC uses a lock for x86-64.
gcc7 and later will call libatomic instead of inlining lock cmpxchg16b, and will return false from atomic_is_lock_free (for reasons including that it's so slow it's not what users expect is_lock_free to mean), but at least for now the libatomic implementation still uses lock cmpxchg16b on targets where that instruction is available. (It can even segfault for read-only atomic objects, so it's really not ideal. More importantly, readers contend with other readers for exclusive access to the cache line. That's why we're going to so much trouble to avoid lock cmpxchg16b for the read side here when we only want one 8-byte half.)

Here's an example of code with a CAS retry-loop that compiles to asm that looks right, and I think is free of UB or other unsafe C++ for implementations that allow union type punning. It's written in C style (non-member functions, etc.) but it would be the same if you wrote member functions.

See the code with asm output from gcc6.3 on the Godbolt compiler explorer. With -m32, it uses cmpxchg8b the same way 64-bit code uses cmpxchg16b. With -mx32 (32-bit pointers in long mode) it can simply use a 64-bit cmpxchg, and normal 64-bit integer loads to grab both members in one atomic load.

This is portable C++11 (except the union type-punning), with nothing x86-specific. It's only efficient on targets that can CAS an object the size of two pointers, though. e.g. it compiles to a call to a __atomic_compare_exchange_16 library function for ARM / ARM64 and MIPS64, as you can see on Godbolt.

It doesn't compile on MSVC, where atomic<counted_ptr> is larger than counted_ptr_separate, so the static_assert catches it. Presumably MSVC includes a lock member in the atomic object.

#include <atomic>
#include <stdint.h>

using namespace std;

struct node {
  // This alignas is essential for clang to use cmpxchg16b instead of a function call
  // Apparently just having it on the union member isn't enough.
  struct alignas(2*sizeof(node*)) counted_ptr {
    node *    ptr;
    uintptr_t count;  // use pointer-sized integers to avoid padding
  };
  
  // hack to allow reading just the pointer without lock-cmpxchg16b,
  // but still without any C++ data race
  struct counted_ptr_separate {
    atomic<node *>    ptr;
    atomic<uintptr_t> count_separate;  // var name emphasizes that accessing this way isn't atomic with ptr
  };

  static_assert(sizeof(atomic<counted_ptr>) == sizeof(counted_ptr_separate), "atomic<counted_ptr> isn't the same size as the separate version; union type-punning will be bogus");
  //static_assert(std::atomic<counted_ptr>{}.is_lock_free());
  
  union {  // anonymous union: the members are directly part of struct node
    alignas(2*sizeof(node*)) atomic<counted_ptr> next_and_count;
    counted_ptr_separate  next;
  };
  // TODO: write member functions to read next.ptr or read/write next_and_count

  int data[4];
};


// make sure read-only access is efficient.
node *follow(node *p) {   // good asm, just a mov load
  return p->next.ptr.load(memory_order_acquire);
}
node *follow_nounion(node *p) {  // really bad asm, using cmpxchg16b to load the whole thing
  return p->next_and_count.load(memory_order_acquire).ptr;
}


void update_next(node &target, node *desired)
{
  // read the old value efficiently to avoid overhead for the no-contention case
  // tearing (or stale data from a relaxed load) will just lead to a retry
  node::counted_ptr expected = {
      target.next.ptr.load(memory_order_relaxed),
      target.next.count_separate.load(memory_order_relaxed) };

  bool success;
  do {
    node::counted_ptr newval = { desired, expected.count + 1 };
    // x86-64: compiles to cmpxchg16b
    success = target.next_and_count.compare_exchange_weak(
                           expected, newval, memory_order_acq_rel);
    // updates exected on failure
  } while( !success );
}

The asm output from clang 4.0 -O3 -mcx16 is:

update_next(node&, node*):
    push    rbx             # cmpxchg16b uses rbx implicitly so it has to be saved/restored
    mov     rbx, rsi
    mov     rax, qword ptr [rdi]     # load the pointer
    mov     rdx, qword ptr [rdi + 8] # load the counter
.LBB2_1:                        # =>This Inner Loop Header: Depth=1
    lea     rcx, [rdx + 1]
    lock
    cmpxchg16b      xmmword ptr [rdi]
    jne     .LBB2_1
    pop     rbx
    ret

gcc does some clunky store/reloads, but is basically the same logic.

follow(node*) compiles to mov rax, [rdi] / ret, so read-only access to the pointer is as cheap as it should be, thanks to the union hack.

It does depend on writing a union through one member and reading it through another, to do efficient reads of just the pointer without using a lock cmpxchg16b. This is guaranteed to work in GNU C++ (and ISO C99/C11), but not in ISO C++. Many other C++ compilers do guarantee that union type-punning works, but even without that it would probably still work: We're always using std::atomic loads which have to assume that the value was modified asynchronously. So we should be immune from aliasing-like problems where values in registers are still considered live after writing the value through another pointer (or union member). Compile-time reordering of things the compiler thinks are independent could be a problem, though.

Atomically reading just the pointer after an atomic cmpxchg of the pointer+counter should still give you acquire/release semantics on x86, but I don't think ISO C++ says anything about it. I would guess that a wide release-store (as part of the compare_exchange_weak will synchronize with a narrower load from the same address on most architectures (like it does on x86), but AFAIK C++ std::atomic doesn't guarantee anything about type-punning.

Not relevant for pointer + ABA-counter, but could come up for other applications of using a union to allow access to subsets of larger atomic object: Don't use the union to allow atomic stores to just the pointer or just the counter. At least not if you care about synchronization with an acquire-load of the pair. Even strongly-ordered x86 can reorder a narrow store with a wider load that fully contains it. Everything is still atomic, but you get into weird territory as far as memory-ordering goes.

On x86-64, an atomic 16B load requires a lock cmpxchg16b (which is a full memory barrier, preventing a preceding narrow store from becoming globally visible after it). But you could easily have a problem if you were using this with 32-bit pointers (or 32-bit array indices), since both halves could be loaded with a regular 64b load. And I have no idea what problems you could see on other architectures, if you need synchronization with other threads and not just atomicity.

To learn more about std::memory_order acquire and release, see Jeff Preshing's excellent articles.

ABA Race Condition

Your example does not really show an ABA problem. Wikipedia:

[..] the ABA problem occurs during synchronization, when a location is read twice, has the same value for both reads, and "value is the same" is used to indicate "nothing has changed". However, another thread can execute between the two reads and change the value, do other work, then change the value back, thus fooling the first thread into thinking "nothing has changed" even though the second thread did work that violates that assumption.

That said, it is your responsibility to ensure that a pointer is still valid (i.e., points to an existing object/memory that has not been freed) at the time it is dereferenced. In case multiple threads are involved, it is also your responsibility to ensure the required happens-before relations (if any) that are necessary to avoid potential data races.

Avoiding the ABA problem in a lock-free algorithm can be quite tricky. It is usually simplest to delegate this to an established memory reclamation scheme. My xenium library provides implementations of various reclamation schemes that can be used for just that.

Partial-compare-and-full-swap for atomic values

I think you have to do a load, then your custom condition, then a compare-and-swap, where the comparison is that the current value is totally equal to the read value. If the last step fails, loop.

template<class T, class F>
bool swap_if(std::atomic<T>& atomic, T desired, F&& condition) {
    for (;;) {
        T data = atomic.load();
        if (!condition(data)) break;
        if (atomic.compare_exchange_weak(data, desired)) return true;
    }
    return false;
}

http://coliru.stacked-crooked.com/a/a394e336628246a9

Due to the complexity, you should probably just use a mutex.
Separately, std::atomic<Data> may already be using a mutex under the covers, since Data is so big.

What is the C++ 11 atomic library equivalent of Java's AtomicMarkableReferenceT

Assuming object's alignement be more than 1, you may use the last bit of the pointer as a boolean mark:

template<class T>
class MarkableReference
{
private:
    uintptr_t val;
    static const uintptr_t mask = 1;
public:
    MarkableReference(T* ref = NULL, bool mark = false)
    {
        val = ((uintptr_t)ref & ~mask) | (mark ? 1 : 0);
    }
    T* getRef(void)
    {
        return (T*)(val & ~mask);
    }
    bool getMark(void)
    {
        return (val & mask);
    }
};

For perform atomic operations, you need to create atomic variable from this class. E.g., if type of object, pointed by a reference, should be int, you may create this variable:

atomic<MarkableReference<int>> mRef;

For variable mRef you may apply any operation, which is applied for normal atomic. For example, Compare-and-Set (CAS):

int a;
int b;
...
// Atomically change pair (&a, true) to (&b, false).
MarkableReference<int> old = mRef.load();
if(old == MarkableReference(&a, true))
{
    if(mRef.compare_exchange_strong(old, MarkableReference(&b, false)))
    {
        // Successful CAS
    }
}
// 'old' contains other value. (Unsuccessfull CAS)

Is there anything like Java's AtomicStampedReference in C++?

There isn't a direct equivalent. You could implement it yourself, the source for AtomicStampedReference is here: https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/util/concurrent/atomic/AtomicStampedReference.java

You could probably implement this in c++ maybe making use of std::atomic<std::shared_ptr> to implement the private volatile Pair<V> pair.

If you don't need the full functionality of AtomicStampedReference you can probably use std::atomic<std::shared_ptr> directly in your code. If you don't have c++20 then you can use the previous stand-alone atomic shared_ptr functions