How Is a Vector's Data Aligned

How is a vector's data aligned?

C++ standard requires allocation functions (malloc() and operator new()) to allocate memory suitably aligned for any standard type. As these functions don't receive the alignment requirement as an argument, in practice it means that the alignment for all allocations is the same, and is that of a standard type with the largest alignment requirement, which often is long double and/or long long (see boost max_align union).

Vector instructions, such as SSE and AVX, have stronger alignment requirements (16-byte aligned for 128-bit access and 32-byte aligned for 256-bit access) than that provided by the standard C++ allocation functions. posix_memalign() or memalign() can be used to satisfy such allocations with stronger alignment requirements.

In C++17 the allocation functions accept an additional argument of type std::align_val_t.

You can make use of it like:

#include <immintrin.h>
#include <memory>
#include <new>

int main() {
    std::unique_ptr<__m256i[]> arr{new(std::align_val_t{alignof(__m256i)}) __m256i[32]};
}

Moreover, in C++17 the standard allocators have been updated to respect type's alignment, so you can simply do:

#include <immintrin.h>
#include <vector>

int main() {
    std::vector<__m256i> arr2(32);
}

Or (no heap allocation involved and supported in C++11):

#include <immintrin.h>
#include <array>

int main() {
    std::array<__m256i, 32> arr3;
}

How is a Vector of Vector aligned in memory?

The size of the vector<int> struct that is stored in ref is constant. Common implementations has this as three pointers, or around 12 bytes on 32-bit architectures, or 24 bytes on shiny new 64-bit architectures.

So ref manages roughly ref.capacity() * 12 bytes of continuous storage.

Each element/vector<int> in ref manages its own integers independent of the elements ref manages. In the artistic rendering below ref.size() == ref.capacity() for the sake of simplicity.

Pretty picture

So your

ref.resize(i);

only affects the top row. Your

ref[i].push_back(23);

only affects the i-th column.

Making std::vector allocate aligned memory

Starting in C++17, just use std::vector<__m256i> or with any other aligned type. There's aligned version of operator new, it is used by std::allocator for aligned types (as well as by plain new-expression, so new __m256i[N] is also safe starting in C++17).

There's a comment by @MarcGlisse saying this, making this an answer to make it more visible.

if T is aligned, std::vector T is aligned too?

C++ default allocators are required to align structs properly aligned for any so-called standard type, and padding automatically added at the end of a struct (visible via sizeof()) generally facilitates this in contiguous allocations.

struct C {
  uint8_t  a; // followed by 7B of invisible padding to naturally align b
  uint64_t b;
  uint32_t c;
  uint8_t  d; // followed by 3B padding for C (natural alignment of 8B due to b)
};
// sizeof(C) = 24B, alignof(C) = 8B

struct D {
  uint8_t  a; // followed by 3B padding for b
  uint32_t b;
  uint8_t  c; // followed by 3B padding for D (natural alignment of 4B due to b)
};
// sizeof(D) = 12B, alignof(D) = 4B

struct E {
  __m256 v; // SSE/AVX intrinsics handle natural alignment properly too
  char v2;
};
// sizeof(E) = 64B, alignof(E) = 32B

For most cases, this is adequate, but if you are doing fancy casting tricks or need 64B cache line alignment, etc., you can use alignas(), provided you are using C++11 or newer. This works partially by padding the end of the structure too:

struct alignas(64) F {
  double stuff[3];
};
// sizeof(F) = 64B, alignof(F) = 64B

void foo() {
  F f[4];
  // these addresses separated by (and even multiples of) 0x40 bytes:
  cout << &f[0] << " " << &f[1] << " " << &f[2] << endl;
}

Use std::aligned_storage<T> if you need a large block aligned against, e.g., 4 kiB page boundaries. But then you're on your own with placement new in general and lose the convenience of std::vector<> doing everything for you.

C++ struct alignment and STL vectors

The standard requires you to be able to create an array of a struct type. When you do so, the array is required to be contiguous. That means, whatever size is allocated for the struct, it has to be one that allows you to create an array of them. To ensure that, the compiler can allocate extra space inside the structure, but cannot require any extra space between the structs.

The space for the data in a vector is (normally) allocated with ::operator new (via an Allocator class), and ::operator new is required to allocate space that's properly aligned to store any type.

You could supply your own Allocator and/or overload ::operator new -- but if you do, your version is still required to meet the same requirements, so it won't change anything in this respect.

In other words, exactly what you want is required to work as long as the data in the file was created in essentially the same way you're trying to read it back in. If it was created on another machine or with a different compiler (or even the same compiler with different flags) you have a fair number of potential problems -- you might get differences in endianness, padding in the struct, and so on.

Edit: Given that you don't know whether the structs have been written out in the format expected by the compiler, you not only need to read the structs one at a time -- you really need to read the items in the structs one at a time, then put each into a temporary struct, and finally add that filled-in struct to your collection.

Fortunately, you can overload operator>> to automate most of this. It doesn't improve speed (for example), but it can keep your code cleaner:

struct whatever { 
    int x, y, z;
    char stuff[672-3*sizeof(int)];

    friend std::istream &operator>>(std::istream &is, whatever &w) { 
       is >> w.x >> w.y >> w.z;
       return is.read(w.stuff, sizeof(w.stuff);
    } 
};

int main(int argc, char **argv) { 
    std::vector<whatever> data;

    assert(argc>1);

    std::ifstream infile(argv[1]);

    std::copy(std::istream_iterator<whatever>(infile),
              std::istream_iterator<whatever>(),
              std::back_inserter(data));  
    return 0;
}

What is the byte alignment of the elements in a std::vector char ?

The elements of the container have at least the alignment required for them in that implementation: if int is 4-aligned in your implementation, then each element of a vector<int> is an int and therefore is 4-aligned. I say "if" because there's a difference between size and alignment requirements - just because int has size 4 doesn't necessarily mean that it must be 4-aligned, as far as the standard is concerned. It's very common, though, since int is usually the word size of the machine, and most machines have advantages for memory access on word boundaries. So it makes sense to align int even if it's not strictly necessary. On x86, for example, you can do unaligned word-sized memory access, but it's slower than aligned. On ARM unaligned word operations are not allowed, and typically crash.

vector guarantees contiguous storage, so there won't be any "padding" in between the first and second element of a vector<char>, if that's what you're concerned about. The specific requirement for std::vector is that for 0 < n < vec.size(), &vec[n] == &vec[0] + n.

[Edit: this bit is now irrelevant, the questioner has disambiguated: The container itself will usually have whatever alignment is required for a pointer, regardless of what the value_type is. That's because the vector itself would not normally incorporate any elements, but will have a pointer to some dynamically-allocated memory with the elements in that. This isn't explicitly required, but it's a predictable implementation detail.]

Every object in C++ is 1-aligned, the only things that aren't are bitfields, and the elements of the borderline-crazy special case that is vector<bool>. So you can rest assured that your hope for std::vector<char> is well-founded. Both the vector and its first element will probably also be 4-aligned ;-)

As for how they get aligned - the same way anything in C++ gets aligned. When memory is allocated from the heap, it is required to be aligned sufficiently for any object that can fit into the allocation. When objects are placed on the stack, the compiler is responsible for designing the stack layout. The calling convention will specify the alignment of the stack pointer on function entry, then the compiler knows the size and alignment requirement of each object it lays down, so it knows whether the stack needs any padding to bring the next object to the correct alignment.

Memory alignment of Armadillo vectors vec/fvec

The Armadillo do not seems to talk about this point in the documentation so it is left unspecified. Thus, vector data are likely not ensured to be 32-bytes aligned.

However, you do not need vector data to be aligned to load them in AVX registers: you can use the unaligned load intrinsic _mm256_loadu_ps. AFAIK, the performance of _mm256_load_ps and _mm256_loadu_ps is about the same on relatively-new x86 processors.

How to create std::vector of char/std::byte where first byte is aligned to 16 byes, but there is no padding?

Aligning the data in a vector ain't provided by default. Not even for aligned classes.

The best way of doing alignment is with the aligned_allocator of boost.

Unfortunately, it doesn't prevent padding, it even overallocates to adapt the pointer on the alignment. From C++17, it can used aligned new (see std::aligned_val_t overloads). However, all implementations I've seen actually use the same trick.

An alternative is allocating a whole page at once, and do your own memory management with a custom allocator. You can do it, though it might take a lot of time to do correctly.

Usage of alignas in template argument of std::vector

If alignas(32)double compiled, it would require that each element separately had 32-byte alignment, i.e. pad each double out to 32 bytes, completely defeating SIMD. (I don't think it will compile, but similar things with GNU C typedef double da __attribute__((aligned(32))) do compile that way, with sizeof(da) == 32.)

See Modern approach to making std::vector allocate aligned memory for working code.

As of C++17, std::vector<__m256d> would work, but is usually not what you want because it makes scalar access a pain.

C++ sucks for this in my experience, although there might be a standard (or Boost) allocator that takes an over-alignment you can use as the second (usually defaulted) template param.

std::vector<double, some_aligned_allocator<32> > still isn't type-compatible with normal std::vector, which makes sense because any function that might reallocated it has to maintain alignment. But unfortunately that makes it not type-compatible even for passing to functions that only want read-only access to a std::vector of double elements.

Cost of misalignment

For a lot of cases the misalignment is only a couple percent worse than aligned, for AVX/AVX2 loops over an array if data's coming from L3 cache or RAM (on recent Intel CPUs); only with 64-byte vectors do you get a significantly bigger penalty (like 15% or so even when memory bandwidth is still the bottleneck.) You'd hope that the CPU core would have time to deal with it and keep the same number of outstanding off-core transactions in flight. But it doesn't.

For data hot in L1d, misalignment could hurt more even with 32-byte vectors.

In x86-64 code, alignof(max_align_t) is 16 on mainstream C++ implementations, so in practice even a vector<double> will end up aligned by 16 at least because the underlying allocator used by new always aligns at least that much. But that's very often an odd multiple of 16, at least on GNU/Linux. Glibc's allocator (also used by malloc) for large allocations uses mmap to get a whole range of pages, but it reserves the first 16 bytes for bookkeeping info. This is unfortunate for AVX and AVX-512 because it means your arrays are always misaligned unless you used aligned allocations. (How to solve the 32-byte-alignment issue for AVX load/store operations?)

Mainstream std::vector implementations are also inefficient when they have to grow: C++ doesn't provide a realloc equivalent that's compatible with new/delete, so it always has to allocate more space and copy to the start. Never even trying to allocate more space contiguous with the existing mapping (which would be safe even for non-trivially-copyable types), and not using implementation-specific tricks like Linux mremap to map the same physical pages to a different virtual address without having to copy all those mega/gigabytes. The fact that C++ allows code to redefine operator new means library implementations of std::vector can't just use a better allocator, either. All of this is a non-problem if you .reserve the size you're going to need, but it is pretty dumb.

How Is a Vector's Data Aligned