Is There a Non-Atomic Equivalent of Std::Shared_Ptr? and Why Isn't There One in <Memory>

Is there a non-atomic equivalent of std::shared_ptr? And why isn't there one in memory ?


1. I'm wondering if there is a non-atomic version of std::shared_ptr available

Not provided by the standard. There may well be one provided by a "3rd party" library. Indeed, prior to C++11, and prior to Boost, it seemed like everyone wrote their own reference counted smart pointer (including myself).

2. My second question is why wasn't a non-atomic version of std::shared_ptr provided in C++11?

This question was discussed at the Rapperswil meeting in 2010. The subject was introduced by a National Body Comment #20 by Switzerland. There were strong arguments on both sides of the debate, including those you provide in your question. However, at the end of the discussion, the vote was overwhelmingly (but not unanimous) against adding an unsynchronized (non-atomic) version of shared_ptr.

Arguments against included:

  • Code written with the unsynchronized shared_ptr may end up being used in threaded code down the road, ending up causing difficult to debug problems with no warning.

  • Having one "universal" shared_ptr that is the "one way" to traffic in reference counting has benefits: From the original proposal:

    Has the same object type regardless of features used, greatly facilitating interoperability between libraries, including third-party libraries.

  • The cost of the atomics, while not zero, is not overwhelming. The cost is mitigated by the use of move construction and move assignment which do not need to use atomic operations. Such operations are commonly used in vector<shared_ptr<T>> erase and insert.

  • Nothing prohibits people from writing their own non-atomic reference-counted smart pointer if that's really what they want to do.

The final word from the LWG in Rapperswil that day was:

Reject CH 20. No consensus to make a change at this time.

Is there a non-atomic equivalent of std::shared_ptr? And why isn't there one in memory ?


1. I'm wondering if there is a non-atomic version of std::shared_ptr available

Not provided by the standard. There may well be one provided by a "3rd party" library. Indeed, prior to C++11, and prior to Boost, it seemed like everyone wrote their own reference counted smart pointer (including myself).

2. My second question is why wasn't a non-atomic version of std::shared_ptr provided in C++11?

This question was discussed at the Rapperswil meeting in 2010. The subject was introduced by a National Body Comment #20 by Switzerland. There were strong arguments on both sides of the debate, including those you provide in your question. However, at the end of the discussion, the vote was overwhelmingly (but not unanimous) against adding an unsynchronized (non-atomic) version of shared_ptr.

Arguments against included:

  • Code written with the unsynchronized shared_ptr may end up being used in threaded code down the road, ending up causing difficult to debug problems with no warning.

  • Having one "universal" shared_ptr that is the "one way" to traffic in reference counting has benefits: From the original proposal:

    Has the same object type regardless of features used, greatly facilitating interoperability between libraries, including third-party libraries.

  • The cost of the atomics, while not zero, is not overwhelming. The cost is mitigated by the use of move construction and move assignment which do not need to use atomic operations. Such operations are commonly used in vector<shared_ptr<T>> erase and insert.

  • Nothing prohibits people from writing their own non-atomic reference-counted smart pointer if that's really what they want to do.

The final word from the LWG in Rapperswil that day was:

Reject CH 20. No consensus to make a change at this time.

Why is shared_ptr implemented using control block and not a static map?


So why does a committee decided to stick with a control block implementation?

It doesn't. The committee writes requirements that implementers must follow. They do not specify that std::shared_ptr be implemented in any particular way, so long as that way meets the requirements.

Having said that, your proposed static std::map<T*, size_t> runs foul of this general (unless otherwise specified) requirement:

A C++ standard library function shall not directly or indirectly
access objects ([intro.multithread]) accessible by threads other than
the current thread unless the objects are accessed directly or
indirectly via the function's arguments, including this.

[res.on.data.races#2]

Another compelling reason is that std::shared_ptr type-erases the deleter, so at best you'd have a static std::map<void *, struct { size_t count; size_t weak_count; std::function<void(void*)> deleter; }>, at which point you have two dynamic allocations for the control block, rather than the one (possibly merged with the owned object) that current implementers prefer.

Do shared pointers use atomic operations in reference counting even in a single threaded program


  1. It will depend on the compiler. Visual Studio 2017 is not smart enough. I cannot tell what clang will do (I am not using them on daily basis) but I will bet that their are not that smart either. As @yachoor pointed out in comment "g++ on Linux is smart enough - it doesn't use atomic operations for std::shared_ptr if program isn't linked with pthread"

  2. Not sure but there is no standard way to do this. Take a look at this. You can use std::move operator so there reference will not be increased. If that is not the case I think that there is no easy way to do this.

Regarding point 2, there are other possibilities. You can extract the pointer to that object and pass it as reference everywhere it is needed in your program. As it is single threaded you should be pretty much sure about lifetime of this object. Otherwise you might want to rethink you memory ownership desing.

Concurrent reads on non-atomic variable

Concurrent reads on any variable, whether atomic or not, do not constitute a data race, because of the definition of conflicting evaluations, found in [intro.multithread]:

Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.

Recently, this has moved to [intro.races] with a very subtle change in wording

Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.

The change from accesses to reads took place between draft n4296 and n4431. The splitting of the multithreading section took place between n4582 and n4604.

Why don't compilers merge redundant std::atomic writes?

The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:

  y.store(1, order);
y.store(2, order);
y.store(3, order); // inlining + constant-folding could produce this in real code

The standard does not guarantee that an observer spinning on y (with an atomic load or CAS) will ever see y == 2. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)

Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1 and y=3.

It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.


So why don't compilers do this optimization?

It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.

The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.

There's no C++11 std::atomic way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)

Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store(). (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.

However, there are cases where it would be very helpful, for example avoiding useless shared_ptr ref count inc/dec in a loop.

Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--; would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num.


Discussion is under way to extend the std::atomic API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:

  • http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
  • http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?

See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)


Within the current standard, volatile atomic<int> y would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile and atomic already share some requirements, but they are different). See also std::memory_order's relationship with volatile on cppreference.

Accesses to volatile objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).

Using volatile atomic<T> mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.

I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]].

wg21/p0062 warns that even volatile atomic doesn't solve everything, and discourages its use for this purpose. It gives this example:

if(x) {
foo();
y.store(0);
} else {
bar();
y.store(0); // release a lock before a long-running loop
for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.

Even with volatile atomic<int> y, a compiler is allowed to sink the y.store() out of the if/else and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed or release instead of seq_cst.

volatile does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<> can also be problematic for real performance.


Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.

The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.

Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).

Single threaded shared pointer for simple inclusion in large project

It is quite possible to write/adapt your own smart pointer class for use in legacy projects that the newer libraries do not support. I have written my own for just such a reason and it works well with MSVC++ 4.2 and onwards.

See ootips.org/yonat/4dev/smart-pointers.html which is what I based mine on. Definitely a possibility if you want a small solution. Just the header and .cpp file required.

The key point to watch out for is the lack of the explicit keyword in older compilers. Another is that you may want to allow implicit conversion to the raw pointer to allow your APIs to remain less affected (we did this) and in that case you should also take care to prevent conversion to other types of pointers as well.

Linking pthread disables lock-free shared_ptr implementation

If you use shared_ptr in a threaded environment, you NEED to have locks [of some kind - they could be implemented as atomic increment and decrement, but there may be places where a "bigger" lock is required to ensure no races]. The lockless version only works when there is only one thread. If you are not using threads, don't link with -lpthread.

I'm sure there is some tricky way to convince the compiler that you are not REALLY using the threads for your shared pointers, but you are REALLY in fragile territory if you do - what happens if a shared_ptr is passed to a thread? You may be able to guarantee that NOW, but someone will probably accidentally or on purpose introduce one into something that runs in a different thread, and it all breaks.



Related Topics



Leave a reply



Submit