Making Std::Vector Allocate Aligned Memory

Making std::vector allocate aligned memory

Starting in C++17, just use std::vector<__m256i> or with any other aligned type. There's aligned version of operator new, it is used by std::allocator for aligned types (as well as by plain new-expression, so new __m256i[N] is also safe starting in C++17).

There's a comment by @MarcGlisse saying this, making this an answer to make it more visible.

Usage of alignas in template argument of std::vector

If alignas(32)double compiled, it would require that each element separately had 32-byte alignment, i.e. pad each double out to 32 bytes, completely defeating SIMD. (I don't think it will compile, but similar things with GNU C typedef double da __attribute__((aligned(32))) do compile that way, with sizeof(da) == 32.)

See Modern approach to making std::vector allocate aligned memory for working code.

As of C++17, std::vector<__m256d> would work, but is usually not what you want because it makes scalar access a pain.

C++ sucks for this in my experience, although there might be a standard (or Boost) allocator that takes an over-alignment you can use as the second (usually defaulted) template param.

std::vector<double, some_aligned_allocator<32> > still isn't type-compatible with normal std::vector, which makes sense because any function that might reallocated it has to maintain alignment. But unfortunately that makes it not type-compatible even for passing to functions that only want read-only access to a std::vector of double elements.

Cost of misalignment

For a lot of cases the misalignment is only a couple percent worse than aligned, for AVX/AVX2 loops over an array if data's coming from L3 cache or RAM (on recent Intel CPUs); only with 64-byte vectors do you get a significantly bigger penalty (like 15% or so even when memory bandwidth is still the bottleneck.) You'd hope that the CPU core would have time to deal with it and keep the same number of outstanding off-core transactions in flight. But it doesn't.

For data hot in L1d, misalignment could hurt more even with 32-byte vectors.

In x86-64 code, alignof(max_align_t) is 16 on mainstream C++ implementations, so in practice even a vector<double> will end up aligned by 16 at least because the underlying allocator used by new always aligns at least that much. But that's very often an odd multiple of 16, at least on GNU/Linux. Glibc's allocator (also used by malloc) for large allocations uses mmap to get a whole range of pages, but it reserves the first 16 bytes for bookkeeping info. This is unfortunate for AVX and AVX-512 because it means your arrays are always misaligned unless you used aligned allocations. (How to solve the 32-byte-alignment issue for AVX load/store operations?)

Mainstream std::vector implementations are also inefficient when they have to grow: C++ doesn't provide a realloc equivalent that's compatible with new/delete, so it always has to allocate more space and copy to the start. Never even trying to allocate more space contiguous with the existing mapping (which would be safe even for non-trivially-copyable types), and not using implementation-specific tricks like Linux mremap to map the same physical pages to a different virtual address without having to copy all those mega/gigabytes. The fact that C++ allows code to redefine operator new means library implementations of std::vector can't just use a better allocator, either. All of this is a non-problem if you .reserve the size you're going to need, but it is pretty dumb.

Is it possible to have a std::vector char allocate memory with a chosen memory alignment

I could solve my issue with a custom allocator.

Example with boost::alignment::aligned_allocator

#include <vector>
#include <boost/align/aligned_allocator.hpp>

template <typename T>
using aligned_vector = std::vector<T, boost::alignment::aligned_allocator<T, 16>>;
// 16 bytes aligned allocation

See also How is a vector's data aligned?.

How is a vector's data aligned?

C++ standard requires allocation functions (malloc() and operator new()) to allocate memory suitably aligned for any standard type. As these functions don't receive the alignment requirement as an argument, in practice it means that the alignment for all allocations is the same, and is that of a standard type with the largest alignment requirement, which often is long double and/or long long (see boost max_align union).

Vector instructions, such as SSE and AVX, have stronger alignment requirements (16-byte aligned for 128-bit access and 32-byte aligned for 256-bit access) than that provided by the standard C++ allocation functions. posix_memalign() or memalign() can be used to satisfy such allocations with stronger alignment requirements.

In C++17 the allocation functions accept an additional argument of type std::align_val_t.

You can make use of it like:

#include <immintrin.h>
#include <memory>
#include <new>

int main() {
    std::unique_ptr<__m256i[]> arr{new(std::align_val_t{alignof(__m256i)}) __m256i[32]};
}

Moreover, in C++17 the standard allocators have been updated to respect type's alignment, so you can simply do:

#include <immintrin.h>
#include <vector>

int main() {
    std::vector<__m256i> arr2(32);
}

Or (no heap allocation involved and supported in C++11):

#include <immintrin.h>
#include <array>

int main() {
    std::array<__m256i, 32> arr3;
}

Aligned allocation of elements in vector

In practice, I see that it is true in Visual Studio 2019, and in gcc 8+. But can I be absolutely sure, or is it just a coincidence and some custom allocator in std::vector (like boost::alignment::aligned_allocator) is necessary?

There is no reason to expect that, provided the absence of bugs in the implementation of the respective compiler (which can however be checked on the assembly level, if required).

Since C++11, there is the alignas-specifier which allows you to enforce the alignment in a standardized way. Consequently, the standard allocator will call operator new upon calling allocator::allocate(), to which it will forward the alignment information according to the documentation. Thus, the standard allocator already respects alignment needs, if specified. However, of course if the global operator new is overloaded by a custom implementation, no such guarantee can be made.

How to make tr1::array allocate aligned memory?

tr1::array (and std::array and boost::array) are POD, so the memory occupied by the contents is coincident with the memory of the array. So, allocate the array however you need to, and construct it with placement new.

typedef std::tr1::array< MyClass, ary_sz > AryT;
void *array_storage = aligned_allocation( sizeof( AryT ) );
AryT *ary = new( array_storage ) AryT( initial_value );

How to create std::vector of char/std::byte where first byte is aligned to 16 byes, but there is no padding?

Aligning the data in a vector ain't provided by default. Not even for aligned classes.

The best way of doing alignment is with the aligned_allocator of boost.

Unfortunately, it doesn't prevent padding, it even overallocates to adapt the pointer on the alignment. From C++17, it can used aligned new (see std::aligned_val_t overloads). However, all implementations I've seen actually use the same trick.

An alternative is allocating a whole page at once, and do your own memory management with a custom allocator. You can do it, though it might take a lot of time to do correctly.

Making Std::Vector Allocate Aligned Memory