Shared_Ptr: Horrible Speed

How much is the overhead of smart pointers compared to normal pointers in C++?

std::unique_ptr has memory overhead only if you provide it with some non-trivial deleter.

std::shared_ptr always has memory overhead for reference counter, though it is very small.

std::unique_ptr has time overhead only during constructor (if it has to copy the provided deleter and/or null-initialize the pointer) and during destructor (to destroy the owned object).

std::shared_ptr has time overhead in constructor (to create the reference counter), in destructor (to decrement the reference counter and possibly destroy the object) and in assignment operator (to increment the reference counter). Due to thread-safety guarantees of std::shared_ptr, these increments/decrements are atomic, thus adding some more overhead.

Note that none of them has time overhead in dereferencing (in getting the reference to owned object), while this operation seems to be the most common for pointers.

To sum up, there is some overhead, but it shouldn't make the code slow unless you continuously create and destroy smart pointers.

Which is faster on Visual C++ 2010 - std::shared_ptr or boost::shared_ptr?

Ok, so it doesn't look like anyone has done this. Here's what I found using the standard VC 10 optimized settings for a WIN32 console app:

Visual C++ 2010 SP1 std::make_shared and std::shared_ptr were faster than the Boost 1.46.1 equivalents when populating a vector of 10 million pointer entries ( 1.96 secs versus 0.92 secs averaged across 20 runs)
Boost 1.46.1 was slightly faster than Visual C++ 2010 SP1 when copying an array of 10 million pointer entries ( 0.15 secs versus 0.17 secs averaged over 20 runs)
Visual C++ 2010 SP1 was slightly faster than the Boost 1.46.1 equivalents when dereferencing a vector of 10 million pointer entries 20 times ( 0.72 secs versus 0.811 secs averaged over 20 runs)

CONCLUSION: There was a significant difference when creating shared_ptrs to populate a vector. The Visual C++ 2010 shared_ptr was nearly twice as fast indicating a substantial difference in implementation compared to Boost 1.46.1.

The other tests didn't show a significant difference.

Here's the code I used:

#include "stdafx.h"

struct A
{
    A( const unsigned A) : m_value(A)
    {
    }

    const unsigned m_value;
};

typedef std::shared_ptr<A> APtr;
typedef boost::shared_ptr<A> ABoostPtr;

double TestSTLCreateSpeed()
{
    const unsigned NUM_ENTRIES = 10000000;
    std::vector<APtr> buffer;
    buffer.reserve(NUM_ENTRIES);

    boost::timer timer;

    for( unsigned nEntry = 0; nEntry < NUM_ENTRIES; ++nEntry)
    {
        buffer.emplace_back( std::make_shared<A>(nEntry) );
    }

    const double timeTaken = timer.elapsed();

    std::cout << "STL create test took " << timeTaken << " secs.\r\n";
    return timeTaken;
}

double BoostSTLCreateSpeed()
{
    const unsigned NUM_ENTRIES = 10000000;
    std::vector<ABoostPtr> buffer;
    buffer.reserve(NUM_ENTRIES);

    boost::timer timer;

    for( unsigned nEntry = 0; nEntry < NUM_ENTRIES; ++nEntry)
    {
        buffer.emplace_back( boost::make_shared<A>(nEntry) );
    }

    const double timeTaken = timer.elapsed();

    std::cout << "BOOST create test took " << timeTaken << " secs.\r\n";
    return timeTaken;
}

double TestSTLCopySpeed()
{
    const unsigned NUM_ENTRIES = 10000000;
    std::vector<APtr> buffer;
    buffer.reserve(NUM_ENTRIES);

    for( unsigned nEntry = 0; nEntry < NUM_ENTRIES; ++nEntry)
    {
        buffer.emplace_back( std::make_shared<A>(nEntry) );
    }

    boost::timer timer;
    std::vector<APtr> buffer2 = buffer;

    const double timeTaken = timer.elapsed();

    std::cout << "STL copy test took " << timeTaken << " secs.\r\n";
    return timeTaken;
}

double TestBoostCopySpeed()
{
    const unsigned NUM_ENTRIES = 10000000;
    std::vector<ABoostPtr> buffer;
    buffer.reserve(NUM_ENTRIES);

    for( unsigned nEntry = 0; nEntry < NUM_ENTRIES; ++nEntry)
    {
        buffer.emplace_back( boost::make_shared<A>(nEntry) );
    }

    boost::timer timer;
    std::vector<ABoostPtr> buffer2 = buffer;

    const double timeTaken = timer.elapsed();

    std::cout << "BOOST copy test took " << timeTaken << " secs.\r\n";
    return timeTaken;
}

double TestBoostDerefSpeed()
{
    const unsigned NUM_ENTRIES = 10000000;
    std::vector<ABoostPtr> buffer;
    buffer.reserve(NUM_ENTRIES);

    for( unsigned nEntry = 0; nEntry < NUM_ENTRIES; ++nEntry)
    {
        buffer.emplace_back( boost::make_shared<A>(nEntry) );
    }

    boost::timer timer;

    unsigned total = 0;

    for(unsigned nIter = 0; nIter < 20; ++nIter)
    {
        std::for_each( buffer.begin(), buffer.end(),
            [&](const ABoostPtr& pA){ 
                total += pA->m_value;
        });
    }

    const double timeTaken = timer.elapsed();

    std::cout << "BOOST deref total =  " << total << ".\r\n";

    std::cout << "BOOST deref test took " << timeTaken << " secs.\r\n";
    return timeTaken;
}

double TestSTLDerefSpeed()
{
    const unsigned NUM_ENTRIES = 10000000;
    std::vector<APtr> buffer;
    buffer.reserve(NUM_ENTRIES);

    for( unsigned nEntry = 0; nEntry < NUM_ENTRIES; ++nEntry)
    {
        buffer.emplace_back( std::make_shared<A>(nEntry) );
    }

    boost::timer timer;

    unsigned total = 0;
    for(unsigned nIter = 0; nIter < 20; ++nIter)
    {
        std::for_each( buffer.begin(), buffer.end(),
            [&](const APtr& pA){ 
                total += pA->m_value;
        });
    }

    const double timeTaken = timer.elapsed();

    std::cout << "STL deref total =  " << total << ".\r\n";

    std::cout << "STL deref test took " << timeTaken << " secs.\r\n";
    return timeTaken;
}

int _tmain(int argc, _TCHAR* argv[])
{
    double totalTime = 0.0;
    const unsigned NUM_TESTS = 20;

    totalTime = 0.0;

    for ( unsigned nTest = 0; nTest < NUM_TESTS; ++nTest)
    {
        totalTime += BoostSTLCreateSpeed();
    }

    std::cout << "BOOST create test took " << totalTime / NUM_TESTS << " secs average.\r\n";

    totalTime = 0.0;
    for ( unsigned nTest = 0; nTest < NUM_TESTS; ++nTest)
    {
        totalTime += TestSTLCreateSpeed();
    }

    std::cout << "STL create test took " << totalTime / NUM_TESTS << " secs average.\r\n";

    totalTime = 0.0;
    for ( unsigned nTest = 0; nTest < NUM_TESTS; ++nTest)
    {
        totalTime += TestBoostCopySpeed();
    }

    std::cout << "BOOST copy test took " << totalTime / NUM_TESTS << " secs average.\r\n";

    totalTime = 0.0;
    for ( unsigned nTest = 0; nTest < NUM_TESTS; ++nTest)
    {
        totalTime += TestSTLCopySpeed();
    }

    std::cout << "STL copy test took " << totalTime / NUM_TESTS << " secs average.\r\n";

    totalTime = 0.0;
    for ( unsigned nTest = 0; nTest < NUM_TESTS; ++nTest)
    {
        totalTime += TestBoostDerefSpeed();
    }

    std::cout << "Boost deref test took " << totalTime / NUM_TESTS << " secs average.\r\n";

    totalTime = 0.0;
    for ( unsigned nTest = 0; nTest < NUM_TESTS; ++nTest)
    {
        totalTime += TestSTLDerefSpeed();
    }

    std::cout << "STL deref test took " << totalTime / NUM_TESTS << " secs average.\r\n";

    return 0;
}

I'll wait a while and if no one has refuted my results or come up with some better conclusions I'll accept my own answer.

Should we pass a shared_ptr by reference or by value?

This question has been discussed and answered by Scott, Andrei and Herb during Ask Us Anything session at C++ and Beyond 2011. Watch from 4:34 on shared_ptr performance and correctness.

Shortly, there is no reason to pass by value, unless the goal is to share ownership of an object (eg. between different data structures, or between different threads).

Unless you can move-optimise it as explained by Scott Meyers in the talk video linked above, but that is related to actual version of C++ you can use.

A major update to this discussion has happened during GoingNative 2012 conference's Interactive Panel: Ask Us Anything! which is worth watching, especially from 22:50.

shared_ptr, unique_ptr, ownership, Am I overdoing it in this particular case?

It looks to me like your Grid owns the triangles. I am assuming the triangles are relatively light (3-5 dimensions?).

I would think something like this might suit. I am using a container in the Grid to take ownership of the triangles by value. The container will delete the triangles when the Grid goes out of scope.

Then each Cell simply uses raw pointers to keep track of which triangles it references. The Cells don't own the triangles they simply hold pointers to the triangles that are owned by the Grid.

class Grid
{
    struct Cell
    {
        std::vector<Triangle*> triList; // non owning
    };

    void insert(Triangle tri) // pass by value
    {
        tris.push_back(tri); // Grid owns this by value

        for(each cell overlapped by tri) {
            // compute cell index
            uint32_t i = ...

            cells[i].triList.push_back(&tris.back());
        }
    }

    // Use a deque because it won't re-allocate when adding
    // new elements to the end
    std::deque<Triangle> tris; // elements owned by value

   Cell cells[RES * RES * RES]; // point to owned elements
};

void createPoolOfTrianglesAndInsertIntoGrid()
{
    Grid grid; // owns the triangles (by value)

    uint32_t maxTris = 32;

    std::vector<Triangle> tris(maxTris);
    // process the triangles
    // ...
    // now insert into grid
    for(auto tri: tris)
        grid.insert(tri);
    }

    // no need to delete tris here ... it should be done by
    // the grid when we go out of this function's scope
}

NOTE: I use a std::deque to store the triangles by value in the Grid. That's because it won't ever reallocate its internal memory when adding new triangles. If you used a std::vector here your raw pointers would end up dangling when the std::vector resized itself.

Alternatively:

Given that it looks like you build all your triangles in your function and then pass all of them into Grid, why do that one at a time? You could pass the whole container in in one go. If you do this using move semantics you won't even have to copy anything:

class Grid
{
    struct Cell
    {
        std::vector<Triangle*> triList; // non owning
    };

    // Accept the entire container in-tack
    // (no reallocations allowed after this point)
    void insert(std::vector<Triangle> tris) // pass by value (able to move in)
    {
        //
        for(auto& tri: tris)
        {
            for(each cell overlapped by tri) {
                // compute cell index
                uint32_t i = ...

                cells[i].triList.push_back(&tri);
            }
        }
    }

    // Using a vector so it MUST NOT be resized after
    // Cells have been pointed to its elements!!!
    std::vector<Triangle> tris; // elements owned by value
    Cell cells[RES * RES * RES]; // point to owned elements
};

void createPoolOfTrianglesAndInsertIntoGrid()
{
    Grid grid; // owns the triangles (by value)

    uint32_t maxTris = 32;

    // Build the triangles into
    std::vector<Triangle> tris(maxTris);
    // process the triangles
    // ...
    // now insert into grid
    grid.insert(std::move(tris)); // move the whole darn container (very efficient)

    // no need to delete tris here ... it should be done by
    // the grid when we go out of this function's scope
}

NOTE: Now I used a std::vector because you are not adding triangles one by one after they arrive in Grid. But you MUST make sure that inside Grid the owning `std::vector is not resized.

C++ shared_ptr order of release

Assuming you reversed the order of objects in the original question (otherwise, the question doesn't make any sense), you do not need to change the order of release, instead, you should use aliasing form of shared_ptr constructor. Something like that:

std::shared_ptr<B> b(a, a->getSomething());

Why would I std::move an std::shared_ptr?

I think that the one thing the other answers did not emphasize enough is the point of speed.

std::shared_ptr reference count is atomic. increasing or decreasing the reference count requires atomic increment or decrement. This is hundred times slower than non-atomic increment/decrement, not to mention that if we increment and decrement the same counter we wind up with the exact number, wasting a ton of time and resources in the process.

By moving the shared_ptr instead of copying it, we "steal" the atomic reference count and we nullify the other shared_ptr. "stealing" the reference count is not atomic, and it is hundred times faster than copying the shared_ptr (and causing atomic reference increment or decrement).

Do note that this technique is used purely for optimization. copying it (as you suggested) is just as fine functionality-wise.

Moving a std::shared_ptr crashes the program

I don't see anything wrong with this code, and it tests out OK on http://ideone.com/jlShgB too:

#include <memory>
#include <utility>
#include <string>
#include <cassert>

enum GLenum { foo };

// this class doesn't have any default, copy constructors.
struct Dep
{
    Dep(std::string path, GLenum type) {}
    Dep() = delete;
    Dep(Dep const&) = delete;
};

struct Program
{
    std::shared_ptr<Dep> dep1;
    std::shared_ptr<Dep> dep2;

#if 1
    template <class T, class = typename std::enable_if<std::is_constructible<std::shared_ptr<Dep>, T>::value>::type>
    Program(T&& dep1, T&& dep2)
        : dep1(std::forward<T>(dep1)), dep2(std::forward<T>(dep2))
    {
    }
#else
    Program(std::shared_ptr<Dep> dep1, std::shared_ptr<Dep> dep2)
        : dep1(std::move(dep1)), dep2(std::move(dep2))
    {
    }
#endif
};

int main()
{
    auto dep1 = std::make_shared<Dep>("dep1", foo);
    auto dep2 = std::make_shared<Dep>("dep2", foo);
    Program p(std::move(dep1), std::move(dep2));

    assert(!dep1 && !dep2);
}

Of course if you change #if 1 to #if 0, the assert will raise an exception because the dep1/dep2 will not have been moved from.

This leads me to suspect another issue somewhere else. If you can isolate a SSCCE that exhibits the problem, please let me know.

Passing const shared_ptrT& versus just shared_ptrT as parameter

The advantage of passing the shared_ptr by const& is that the reference count doesn't have to be increased and then decreased. Because these operations have to be thread-safe, they can be expensive.

You are quite right that there is a risk that you can have a chain of passes by reference that later invalidates the head of the chain. This happened to me once in a real-world project with real-world consequences. One function found a shared_ptr in a container and passed a reference to it down a call stack. A function deep in the call stack removed the object from the container, causing all the references to suddenly refer to an object that no longer existed.

So when you pass something by reference, the caller must ensure it survives for the life of the function call. Don't use a pass by reference if this is an issue.

(I'm assuming you have a use case where there's some specific reason to pass by shared_ptr rather than by reference. The most common such reason would be that the function called may need to extend the life of the object.)

Update: Some more details on the bug for those interested: This program had objects that were shared and implemented internal thread safety. They were held in containers and it was common for functions to extend their lifetimes.

This particular type of object could live in two containers. One when it was active and one when it was inactive. Some operations worked on active objects, some on inactive objects. The error case occurred when a command was received on an inactive object that made it active while the only shared_ptr to the object was held by the container of inactive objects.

The inactive object was located in its container. A reference to the shared_ptr in the container was passed, by reference, to the command handler. Through a chain of references, this shared_ptr ultimately got to the code that realized this was an inactive object that had to be made active. The object was removed from the inactive container (which destroyed the inactive container's shared_ptr) and added to the active container (which added another reference to the shared_ptr passed to the "add" routine).

At this point, it was possible that the only shared_ptr to the object that existed was the one in the inactive container. Every other function in the call stack just had a reference to it. When the object was removed from the inactive container, the object could be destroyed and all those references were to a shared_ptr that that no longer existed.

It took about a month to untangle this.