Compelling Examples of Custom C++ Allocators

Compelling examples of custom C++ allocators?

As I mention here, I've seen Intel TBB's custom STL allocator significantly improve performance of a multithreaded app simply by changing a single

std::vector<T>

std::vector<T,tbb::scalable_allocator<T> >

(this is a quick and convenient way of switching the allocator to use TBB's nifty thread-private heaps; see page 7 in this document)

Custom allocator performance

Your method only allocates the raw memory in one chunk and then has to do a placement new for each element. Combine that with all the overhead in your bitmap and its not too surprising that the default new allocation beats yours assuming an empty heap.

To get the most speed improvement when allocating you can allocate the entire object in one large array and then assign to it from there. If you look at a very simple and contrived benchmark:

struct test_t {
    float f;
    int i;
    test_t* pNext;
};

const size_t NUM_ALLOCS = 50000000;

void TestNew (void)
{
    test_t* pPtr = new test_t;

    for (int i = 0; i < NUM_ALLOCS; ++i)
    {
        pPtr->pNext = new test_t;
        pPtr = pPtr->pNext;
    }

}

void TestBucket (void)
{
    test_t* pBuckets = new test_t[NUM_ALLOCS + 2];
    test_t* pPtr = pBuckets++;

    for (int i = 0; i < NUM_ALLOCS; ++i)
    {
        pPtr->pNext = pBuckets++;
        pPtr = pPtr->pNext;
    }

}

With this code on MSVC++ 2013 with 50M allocations TestBucket() outperforms TestNew() by over a factor of x16 (130 vs 2080 ms). Even if you add a std::bitset<> to track allocations it is still x4 faster (400 ms).

An important thing to remember about new is that the time its takes to allocate an object generally depends on the state of the heap. An empty heap will be able to allocate a bunch of constant sized objects like this relatively fast, which is probably one reason why your code seems slower than new. If you have a program that runs for a while and allocates a large number of differently sized objects the heap can become fragmented and allocating objects can take much (much) longer.

As an example, one program I wrote was loading a 200MB file with millions of records...lots of differently sized allocations. On the first load it took ~15 seconds but if I deleted that file and tried to load it again it took something like x10-x20 longer. This was entirely due to memory allocation and switching to a simple bucket/arena allocator fixed the issue. So, that contrived benchmark I did showing a x16 speedup might actually get show a significantly larger difference with a fragmented heap.

It gets even more tricky when you realize that different systems/platforms may use different memory allocation schemes so the benchmark results on one system may be different from another.

To distil this into a few short points:

Benchmarking memory allocation is tricky (performance depends on a lot of things)
In some cases you can get better performance with a custom allocator. In a few cases you can get much better.
Creating a custom allocator can be tricky and takes time to profile/benchmark your specific use case.

Note -- Benchmarks like this aren't meant to be realistic but are useful to determine the upper bound of how fast something can be. It can be used along with the profile/benchmark of your actual code to help determine what should/shouldn't be optimized.

Update -- I can't seem to replicate your results in my code under MSVC++ 2013. Using the same structure as your avlnode and trying a placement new yields the same speed as my non-placement bucket allocator tests (placement new was actually a little bit faster). Using a class similar to your avltree doesn't affect the benchmark much. With 10 million allocations/deallocations I'm getting ~800 ms for the new/delete and ~200ms for the custom allocator (both with and without placement new). While I'm not worried about the difference in absolute times, the relative time difference seems odd.

I would suggest taking a closer look at your benchmark and make sure you are measuring what you think you are. If the code exists in a larger code-base then create a minimal test case to benchmark it. Make sure that your compiler optimizer is not doing something that would invalidate the benchmark (it happens too easily these days).

Note that it would be far easier to answer your question if you had reduced it to a minimal example and included the complete code in the question, including the benchmark code. Benchmarking is one of those things that seems easy but there are a lot of "gotchas" involved in it.

Update 2 -- Including the basic allocator class and benchmark code I'm using so others can try to duplicate my results. Note that this is for testing only and is far from actual working/production code. It is far simpler than your code which may be why we're getting different results.

#include <string>
#include <Windows.h>

struct test_t 
{
    __int64 key;
    __int64 weight;
    __int64 left;
    __int64 right;
    test_t* pNext;     // Simple linked list

    test_t() : key(0), weight(0), pNext(NULL), left(0), right(0) { }
    test_t(const __int64 k) : key(k), weight(0), pNext(NULL), left(0), right(0) { }
};

const size_t NUM_ALLOCS = 10000000;
test_t* pLast; //To prevent compiler optimizations from being "smart"

struct CTest 
{
    test_t* m_pBuffer;
    size_t  m_MaxSize;
    size_t  m_FreeIndex;
    test_t* m_pFreeList;

    CTest(const size_t Size) :
            m_pBuffer(NULL),
            m_MaxSize(Size),
            m_pFreeList(NULL),
            m_FreeIndex(0)
    {
        if (m_MaxSize > 0) m_pBuffer = (test_t *) new char[sizeof(test_t) * (m_MaxSize + 1)];
    }

    test_t* NewNode(__int64 key)
    {
        if (!m_pBuffer || m_FreeIndex >= m_MaxSize) return new test_t(key);

        size_t Pos = m_FreeIndex;
        ++m_FreeIndex;
        return new (&m_pBuffer[Pos]) test_t(key);
    }

    void DeleteNode(test_t* pNode)
    {
        if (!m_pBuffer) {
            delete pNode;
        }
        else
        {
            pNode->pNext = m_pFreeList;
            m_pFreeList = pNode;
        }
    }

};


void TestNew(void)
{
    test_t* pPtr = new test_t;
    test_t* pFirst = pPtr;

    for (int i = 0; i < NUM_ALLOCS; ++i)
    {
        pPtr->pNext = new test_t;
        pPtr = pPtr->pNext; 
    }

    pPtr = pFirst;

    while (pPtr)
    {
        test_t* pTemp = pPtr;
        pPtr = pPtr->pNext;
        delete pTemp;
    }

    pLast = pPtr;    
}


void TestClass(const size_t BufferSize)
{
    CTest Alloc(BufferSize);
    test_t* pPtr = Alloc.NewNode(0);
    test_t* pFirstPtr = pPtr;

    for (int i = 0; i < NUM_ALLOCS; ++i)
    {
        pPtr->pNext = Alloc.NewNode(i);
        pPtr = pPtr->pNext;
    }

    pLast = pPtr;
    pPtr = pFirstPtr;

    while (pPtr != NULL)
    {
        test_t* pTmp = pPtr->pNext;
        Alloc.DeleteNode(pPtr);
        pPtr = pTmp;
    }
}


int main(void)
{
    DWORD StartTick = GetTickCount();
    TestClass(0);
    //TestClass(NUM_ALLOCS + 10);
    //TestNew();
    DWORD EndTick = GetTickCount();

    printf("Time = %u ms\n", EndTick - StartTick);
    printf("Last = %p\n", pLast);

    return 0;
}

Currently I'm getting ~800ms for both TestNew() and TestClass(0) and under 200ms for TestClass(NUM_ALLOCS + 10). The custom allocator is pretty fast as it operates on the memory in a completely linear fashion which allows the memory cache to work its magic. I'm also using GetTickCount() for simplicity and it is accurate enough so long as times are above ~100ms.

How can I write a stateful allocator in C++11, given requirements on copy construction?

1) Are my interpretations above correct?

You are right that your free-list might not be a good fit for allocators, it need be able to handle multiple sizes (and alignments) to fit. That's a problem for the free-list to solve.

2) I've read in a few places that C++11 improved support for "stateful allocators". How is that the case, given these restrictions?

It is not so much improved, than born. In C++03 the standard only nudged implementers toward providing allocators which could support non-equal instances and implementers, effectively making stateful allocators non-portable.

3) Do you have any suggestions for how to do the sort of thing I'm trying to do? That is, how do I include allocated-type-specific state in my allocator?

Your allocator may have to be flexible, because you are not supposed to know exactly what memory (and what types) it is supposed to allocate. This requirement is necessary to insulate you (the user) from the internals of some of the container that uses the allocator such as std::list, std::set or std::map.

You can still use such allocators with simple containers such as std::vector or std::deque.

Yes, it is a costly requirement.

4) In general, the language around allocators seems sloppy. (For example, the prologue to Table 28 says to assume that a is of type X&, but some of the expressions redefine a.) Also, at least GCC's support is non-conformant. What accounts for this weirdness around allocators? Is it just an infrequently used feature?

The Standard in general is not exactly easy to read, not only allocators. You do have to be careful.

To be pedant, gcc does not support allocators (it's a compiler). I surmise that you are speaking about libstdc++ (the Standard Library implementation shipped with gcc). libstdc++ is old, and thus it was tailored to C++03. It has been adapted toward C++11, but is not fully conformant yet (still uses Copy-On-Write for strings, for example). The reason is that libstdc++ has a huge focus on binary compatibility, and a number of changes required by C++11 would break this compatibility; they must therefore be introduced carefully.

C++0x allocators

According to the giant table of allocator requirements in section 17.6.3.5, an allocator must be copyable. Containers are allowed to copy them freely. So you need to store the pool in a std::shared_ptr or something similar in order to prevent deletion while one of the allocators is in existence.

Compelling Examples of Custom C++ Allocators