Performance Degradation Due to Default Initialisation of Elements in Standard Containers

Performance degradation due to default initialisation of elements in standard containers

Okay, here is what I've learned since asking this question.

Q1 (Is there any legal way to use a standard library container which would give these latter timings?) Yes to some degree, as shown in the answers by Mark and Evgeny. The method of providing a generator to the constructor of std::vector elides the default construction.

Q2 (Is there any legal way to use a standard library container which would give these latter timings in multi-threaded situations?) No, I don't think so. The reason is that on construction any standard-compliant container must initialise its elements, already to ensure that the call to the element destructors (upon destruction or resizing of the container) is well-formed. As the std library containers do not support the usage of multi-threading in constructing their elements, the trick of Q1 cannot be replicated here, so we cannot elide the default construction.

Thus, if we want to use C++ for high-performance computing our options are somewhat limited when it comes to managing large amounts of data. We can

1 declare a container object and, in the same compilation unit, immediately fill it (concurrently), when the compiler hopefully optimizes the initialization at construction away;

2 resort to new[] and delete[] or even malloc() and free(), when all the memory management and, in the latter case, construction of elements is our responsibility and our potential usage of the C++ standard library very limited.

3 trick a std::vector to not initialise its elements using a custom unitialised_allocator that elides the default construction. Following the ideas of Jared Hoberock such an allocator could look like this (see also here):

// based on a design by Jared Hoberock
// edited (Walter) 10-May-2013, 23-Apr-2014
template<typename T, typename base_allocator = std::allocator<T> >
struct uninitialised_allocator
  : base_allocator
{
  static_assert(std::is_same<T,typename base_allocator::value_type>::value,
                "allocator::value_type mismatch");

  template<typename U>
  using base_t =
    typename std::allocator_traits<base_allocator>::template rebind_alloc<U>;

  // rebind to base_t<U> for all U!=T: we won't leave other types uninitialised!
  template<typename U>
  struct rebind
  {
    typedef typename
    std::conditional<std::is_same<T,U>::value,
                     uninitialised_allocator, base_t<U> >::type other; 
  }

  // elide trivial default construction of objects of type T only
  template<typename U>
  typename std::enable_if<std::is_same<T,U>::value && 
                          std::is_trivially_default_constructible<U>::value>::type
  construct(U*) {}

  // elide trivial default destruction of objects of type T only
  template<typename U>
  typename std::enable_if<std::is_same<T,U>::value && 
                          std::is_trivially_destructible<U>::value>::type
  destroy(U*) {}

  // forward everything else to the base
  using base_allocator::construct;
  using base_allocator::destroy;
};

Then an unitialised_vector<> template could be defined like this:

template<typename T, typename base_allocator = std::allocator<T>>
using uninitialised_vector = std::vector<T,uninitialised_allocator<T,base_allocator>>;

and we can still use almost all of the standard library's functionality. Though it must be said that the uninitialised_allocator, and hence by implication the unitialised_vector are not standard compliant, because its elements are not default constructed (e.g. a vector<int> will not have all 0 after construction).

When using this tool for my little test problem, I get excellent results:

timing vector::vector(n) + set_v0();
n=10000 time: 3.7e-05 sec
n=100000 time: 0.000334 sec
n=1000000 time: 0.002926 sec
n=10000000 time: 0.028649 sec
n=100000000 time: 0.293433 sec

timing vector::vector() + vector::reserve() + set_v1();
n=10000 time: 2e-05 sec
n=100000 time: 0.000178 sec
n=1000000 time: 0.001781 sec
n=10000000 time: 0.020922 sec
n=100000000 time: 0.428243 sec

timing vector::vector() + vector::reserve() + set_v0();
n=10000 time: 9e-06 sec
n=100000 time: 7.3e-05 sec
n=1000000 time: 0.000821 sec
n=10000000 time: 0.011685 sec
n=100000000 time: 0.291055 sec

timing vector::vector(n) + omp parllel set_v0();
n=10000 time: 0.00044 sec
n=100000 time: 0.000183 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.00892 sec
n=100000000 time: 0.088051 sec

timing vector::vector() + vector::reserve() + omp parallel set_v0();
n=10000 time: 0.000192 sec
n=100000 time: 0.000202 sec
n=1000000 time: 0.00067 sec
n=10000000 time: 0.008596 sec
n=100000000 time: 0.088045 sec

when there is no difference any more between the cheating and "legal" versions.

Avoiding default construction of elements in standard containers

I think the problem boils down to the type of initialization that the container performs on elements. Compare:

T * p1 = new T;   // default-initalization
T * p2 = new T(); // value-initialization

The problem with the standard containers is that they take the default argument to be value initialized, as in resize(size_t, T = T()). This means that there's no elegant way to avoid value-initialization or copying. (Similarly for the constructor.)

Even using the standard allocators doesn't work, because their central construct() function takes an argument that becomes value-initialized. What you would rather need is a construct() that uses default-initialization:

template <typename T>
void definit_construct(void * addr)
{
  new (addr) T;  // default-initialization
}

Such a thing wouldn't be a conforming standard allocator any more, but you could build your own container around that idea.

C++ vector that doesn't initialize its members?

For default and value initialization of structs with user-provided default constructors which don't explicitly initialize anything, no initialization is performed on unsigned char members:

struct uninitialized_char {
    unsigned char m;
    uninitialized_char() {}
};

// just to be safe
static_assert(1 == sizeof(uninitialized_char), "");

std::vector<uninitialized_char> v(4 * (1<<20));

GetMyDataFromC(reinterpret_cast<unsigned char*>(&v[0]), v.size());

I think this is even legal under the strict aliasing rules.

When I compared the construction time for v vs. a vector<unsigned char> I got ~8 µs vs ~12 ms. More than 1000x faster. Compiler was clang 3.2 with libc++ and flags: -std=c++11 -Os -fcatch-undefined-behavior -ftrapv -pedantic -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-missing-prototypes

C++11 has a helper for uninitialized storage, std::aligned_storage. Though it requires a compile time size.

Here's an added example, to compare total usage (times in nanoseconds):

VERSION=1 (vector<unsigned char>):

clang++ -std=c++14 -stdlib=libc++ main.cpp -DVERSION=1 -ftrapv -Weverything -Wno-c++98-compat -Wno-sign-conversion -Wno-sign-compare -Os && ./a.out

initialization+first use: 16,425,554
array initialization: 12,228,039
first use: 4,197,515
second use: 4,404,043

VERSION=2 (vector<uninitialized_char>):

clang++ -std=c++14 -stdlib=libc++ main.cpp -DVERSION=2 -ftrapv -Weverything -Wno-c++98-compat -Wno-sign-conversion -Wno-sign-compare -Os && ./a.out

initialization+first use: 7,523,216
array initialization: 12,782
first use: 7,510,434
second use: 4,155,241

#include <iostream>
#include <chrono>
#include <vector>

struct uninitialized_char {
  unsigned char c;
  uninitialized_char() {}
};

void foo(unsigned char *c, int size) {
  for (int i = 0; i < size; ++i) {
    c[i] = '\0';
  }
}

int main() {
  auto start = std::chrono::steady_clock::now();

#if VERSION==1
  using element_type = unsigned char;
#elif VERSION==2
  using element_type = uninitialized_char;
#endif

  std::vector<element_type> v(4 * (1<<20));

  auto end = std::chrono::steady_clock::now();

  foo(reinterpret_cast<unsigned char*>(v.data()), v.size());

  auto end2 = std::chrono::steady_clock::now();

  foo(reinterpret_cast<unsigned char*>(v.data()), v.size());

  auto end3 = std::chrono::steady_clock::now();

  std::cout.imbue(std::locale(""));
  std::cout << "initialization+first use: " << std::chrono::nanoseconds(end2-start).count() << '\n';
  std::cout << "array initialization: " << std::chrono::nanoseconds(end-start).count() << '\n';
  std::cout << "first use: " << std::chrono::nanoseconds(end2-end).count() << '\n';
  std::cout << "second use: " << std::chrono::nanoseconds(end3-end2).count() << '\n';
}

I'm using clang svn-3.6.0 r218006

How to make my uninitialised_allocator safe?

Fwiw, I think the design can be simplified, assuming a C++11 conforming container:

template <class T>
class no_init_allocator
{
public:
    typedef T value_type;

    no_init_allocator() noexcept {}
    template <class U>
        no_init_allocator(const no_init_allocator<U>&) noexcept {}
    T* allocate(std::size_t n)
        {return static_cast<T*>(::operator new(n * sizeof(T)));}
    void deallocate(T* p, std::size_t) noexcept
        {::operator delete(static_cast<void*>(p));}
    template <class U>
        void construct(U*) noexcept
        {
            static_assert(std::is_trivially_default_constructible<U>::value,
            "This allocator can only be used with trivally default constructible types");
        }
    template <class U, class A0, class... Args>
        void construct(U* up, A0&& a0, Args&&... args) noexcept
        {
            ::new(up) U(std::forward<A0>(a0), std::forward<Args>(args)...);
        }
};

I see little advantage to deriving from another allocator.
Now you can let allocator_traits handle rebind.
Template the construct members on U. This helps if you want to use this allocator with some container that needs to allocate something other than a T (e.g. std::list).
Move the static_assert test into the single construct member where it is important.

You can still create a using:

template <class T>
using uninitialised_vector = std::vector<T, no_init_allocator<T>>;

And this still fails to compile:

unitialised_vector< std::vector<int> > x(10);

test.cpp:447:17: error: static_assert failed "This allocator can only be used with trivally default constructible types"
                static_assert(std::is_trivially_default_constructible<U>::value,
                ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I think the test for is_trivially_destructible is overkill, unless you also optimize destroy to do nothing. But I see no motivation in doing that since I believe it should get optimized anyway whenever appropriate. Without such a restriction you can:

class A
{
    int data_;
public:
    A() = default;
    A(int d) : data_(d) {}
};

int main()
{
    uninitialised_vector<A> v(10);
}

And it just works. But if you make ~A() non trivial:

    ~A() {std::cout << "~A(" << data_ << ")\n";}

Then, at least on my system, you get an error on construction:

test.cpp:447:17: error: static_assert failed "This allocator can only be used with trivally default constructible types"
                static_assert(std::is_trivially_default_constructible<U>::value,
                ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I.e. A is no longer trivially constructible if it has a non-trivial destructor.

However even with the non-trivial destructor you can still:

    uninitialised_vector<A> v;
    v.push_back(A());

This works, only because I didn't overreach with requiring a trivial destructor. And when executing this I get ~A() to run as expected:

~A(0)
~A(0)

Is this behavior of vector::resize(size_type n) under C++11 and Boost.Container correct?

Not an answer, but a lengthy addendum to Howard's: I use an allocator adapter that basically works the same as Howard's allocator, but is safer since

it only interposes on value-initialization and not all initializations,
it correctly default-initializes.

// Allocator adaptor that interposes construct() calls to
// convert value initialization into default initialization.
template <typename T, typename A=std::allocator<T>>
class default_init_allocator : public A {
  typedef std::allocator_traits<A> a_t;
public:
  template <typename U> struct rebind {
    using other =
      default_init_allocator<
        U, typename a_t::template rebind_alloc<U>
      >;
  };

  using A::A;

  template <typename U>
  void construct(U* ptr)
    noexcept(std::is_nothrow_default_constructible<U>::value) {
    ::new(static_cast<void*>(ptr)) U;
  }
  template <typename U, typename...Args>
  void construct(U* ptr, Args&&... args) {
    a_t::construct(static_cast<A&>(*this),
                   ptr, std::forward<Args>(args)...);
  }
};

Why does brace initialization in C++ solve the problem of initialization of STL containers?

std::vector<int> v(1,3,5)

It still isn't possible to do the above and that's because when you try to list initialize the standard expects that you use curly braces.

And for this

MyType obj1, obj2;
std::vector<MyType> v{ obj1, obj2 };
//std::vector<MyType> v = { obj1, obj2 }; //similar

if you consider the non-built in types it will become very obvious how list initialization works. In the case above the copy constructor comes into play twice for each element being initialized into the vector. First obj1 is copy constructed into a temporary and this temporary is then copied once again into the vector.

Consider the similarity in syntax to regular copy construction.

class Type {}

Type obj1;
Type obj2{obj1};
Type obj3 = {obj1};

STL vectors with uninitialized storage?

std::vector must initialize the values in the array somehow, which means some constructor (or copy-constructor) must be called. The behavior of vector (or any container class) is undefined if you were to access the uninitialized section of the array as if it were initialized.

The best way is to use reserve() and push_back(), so that the copy-constructor is used, avoiding default-construction.

Using your example code:

struct YourData {
    int d1;
    int d2;
    YourData(int v1, int v2) : d1(v1), d2(v2) {}
};

std::vector<YourData> memberVector;

void GetsCalledALot(int* data1, int* data2, int count) {
    int mvSize = memberVector.size();

    // Does not initialize the extra elements
    memberVector.reserve(mvSize + count);

    // Note: consider using std::generate_n or std::copy instead of this loop.
    for (int i = 0; i < count; ++i) {
        // Copy construct using a temporary.
        memberVector.push_back(YourData(data1[i], data2[i]));
    }
}

The only problem with calling reserve() (or resize()) like this is that you may end up invoking the copy-constructor more often than you need to. If you can make a good prediction as to the final size of the array, it's better to reserve() the space once at the beginning. If you don't know the final size though, at least the number of copies will be minimal on average.

In the current version of C++, the inner loop is a bit inefficient as a temporary value is constructed on the stack, copy-constructed to the vectors memory, and finally the temporary is destroyed. However the next version of C++ has a feature called R-Value references (T&&) which will help.

The interface supplied by std::vector does not allow for another option, which is to use some factory-like class to construct values other than the default. Here is a rough example of what this pattern would look like implemented in C++:

template <typename T>
class my_vector_replacement {

    // ...

    template <typename F>
    my_vector::push_back_using_factory(F factory) {
        // ... check size of array, and resize if needed.

        // Copy construct using placement new,
        new(arrayData+end) T(factory())
        end += sizeof(T);
    }

    char* arrayData;
    size_t end; // Of initialized data in arrayData
};

// One of many possible implementations
struct MyFactory {
    MyFactory(int* p1, int* p2) : d1(p1), d2(p2) {}
    YourData operator()() const {
        return YourData(*d1,*d2);
    }
    int* d1;
    int* d2;
};

void GetsCalledALot(int* data1, int* data2, int count) {
    // ... Still will need the same call to a reserve() type function.

    // Note: consider using std::generate_n or std::copy instead of this loop.
    for (int i = 0; i < count; ++i) {
        // Copy construct using a factory
        memberVector.push_back_using_factory(MyFactory(data1+i, data2+i));
    }
}

Doing this does mean you have to create your own vector class. In this case it also complicates what should have been a simple example. But there may be times where using a factory function like this is better, for instance if the insert is conditional on some other value, and you would have to otherwise unconditionally construct some expensive temporary even if it wasn't actually needed.

Performance Degradation Due to Default Initialisation of Elements in Standard Containers