How to solve the 32-byte-alignment issue for AVX load/store operations?
Yes, you can use _mm256_loadu_ps
/ storeu
for unaligned loads/stores (AVX: data alignment: store crash, storeu, load, loadu doesn't). If the compiler doesn't do a bad job (cough GCC default tuning), AVX _mm256_loadu
/storeu
on data that happens to be aligned is just as fast as alignment-required load/store, so aligning data when convenient still gives you the best of both worlds for functions that normally run on aligned data but let hardware handle the rare cases where they don't. (Instead of always running extra instructions to check stuff).
Alignment is especially important for 512-bit AVX-512 vectors, like 15 to 20% speed on SKX even over large arrays where you'd expect L3 / DRAM bandwidth to be the bottleneck, vs. a few percent with AVX2 CPUs for large arrays. (It can still matter significantly with AVX2 on modern CPUs if your data is hot in L2 or especially L1d cache, especially if you can come close to maxing out 2 loads and/or 1 store per clock. Cache-line splits cost about twice the throughput resources, plus needing a line-split buffer temporarily.)
The standard allocators normally only align to alignof(max_align_t)
, which is often 16B, e.g. long double
in the x86-64 System V ABI. But in some 32-bit ABIs it's only 8B, so it's not even sufficient for dynamic allocation of aligned __m128
vectors and you'll need to go beyond simply calling new
or malloc
.
Static and automatic storage are easy: use alignas(32) float arr[N];
C++17 provides aligned new
for aligned dynamic allocation. If alignof
for a type is greater than the standard alignment, then aligned operator new
/operator delete
are used. So new __m256[N]
just works in C++17 (if compiler supports this C++17 feature; check __cpp_aligned_new
feature macro). In practice, GCC / clang / MSVC / ICX support it, ICC 2021 doesn't.
Without that C++17 feature, even stuff like std::vector<__m256>
will break, not just std::vector<int>
, unless you get lucky and it happens to be aligned by 32.
Plain-delete
compatible allocation of a float
/ int
array:
Unfortunately, auto* arr = new alignas(32) float[numSteps]
does not work for all compilers, as alignas
is applicable to a variable, a member, or a class declaration, but not as type modifier. (GCC accepts using vfloat = alignas(32) float;
, so this does give you an aligned new that's compatible with ordinary delete
on GCC).
Workarounds are either wrapping in a structure (struct alignas(32) s { float v; }; new s[numSteps];
) or passing alignment as placement parameter (new (std::align_val_t(32)) float[numSteps];
), in later case be sure to call matching aligned operator delete
.
See documentation for new
/new[]
and std::align_val_t
Other options, incompatible with new
/delete
Other options for dynamic allocation are mostly compatible with malloc
/free
, not new
/delete
:
std::aligned_alloc
: ISO C++17. major downside: size must be a multiple of alignment. This braindead requirement makes it inappropriate for allocating a 64B cache-line aligned array of an unknown number offloat
s, for example. Or especially a 2M-aligned array to take advantage of transparent hugepages.The C version of
aligned_alloc
was added in ISO C11. It's available in some but not all C++ compilers. As noted on the cppreference page, the C11 version wasn't required to fail when size isn't a multiple of alignment (it's undefined behaviour), so many implementations provided the obvious desired behaviour as an "extension". Discussion is underway to fix this, but for now I can't really recommendaligned_alloc
as a portable way to allocate arbitrary-sized arrays. In practice some implementations work fine in the UB / required-to-fail cases so it can be a good non-portable option.Also, commenters report it's unavailable in MSVC++. See best cross-platform method to get aligned memory for a viable
#ifdef
for Windows. But AFAIK there are no Windows aligned-allocation functions that produce pointers compatible with standardfree
.posix_memalign
: Part of POSIX 2001, not any ISO C or C++ standard. Clunky prototype/interface compared toaligned_alloc
. I've seen gcc generate reloads of the pointer because it wasn't sure that stores into the buffer didn't modify the pointer. (posix_memalign
is passed the address of the pointer, defeating escape analysis.) So if you use this, copy the pointer into another C++ variable that hasn't had its address passed outside the function.
#include <stdlib.h>
int posix_memalign(void **memptr, size_t alignment, size_t size); // POSIX 2001
void *aligned_alloc(size_t alignment, size_t size); // C11 (and ISO C++17)
_mm_malloc
: Available on any platform where_mm_whatever_ps
is available, but you can't pass pointers from it tofree
. On many C and C++ implementations_mm_free
andfree
are compatible, but it's not guaranteed to be portable. (And unlike the other two, it will fail at run-time, not compile time.) On MSVC on Windows,_mm_malloc
uses_aligned_malloc
, which is not compatible withfree
; it crashes in practice.Directly use system calls like
mmap
orVirtualAlloc
. Appropriate for large allocations, and the memory you get is by definition page-aligned (4k, and perhaps even 2M largepage). Not compatible withfree
; you of course have to usemunmap
orVirtualFree
which need the size as well as address. (For large allocations you usually want to hand memory back to the OS when you're done, rather than manage a free-list; glibc malloc uses mmap/munmap directly for malloc/free of blocks over a certain size threshold.)Major advantage: you don't have to deal with C++'s and C's braindead refusal provide grow/shrink facilities for aligned allocators. If you want space for another 1MiB after your allocation, you can even use Linux's
mremap(MREMAP_MAYMOVE)
to let it pick a different place in virtual address space (if needed) for the same physical pages, without having to copy anything. Or if it doesn't have to move, the TLB entries for the currently in use part stay valid.And since you're using OS system calls anyway (and know you're working with whole pages), you can use
madvise(MADV_HUGEPAGE)
to hint that transparent hugepages are preferred, or that they're not, for this range of anonymous pages. You can also use allocation hints withmmap
e.g. for the OS to prefault the zero pages, or if mapping a file on hugetlbfs, to use 2M or 1G pages. (If that kernel mechanism still works).And with
madvise(MADV_FREE)
, you can keep it mapped, but let the kernel reclaim the pages as memory pressure occurs, making it like lazilly allocated zero-backed pages if that happens. So if you do reuse it soon, you may not suffer fresh page faults. But if you don't, you're not hogging it, and when you do read it, it's like a freshly mmapped region.
alignas()
with arrays / structs
In C++11 and later: use alignas(32) float avx_array[1234]
as the first member of a struct/class member (or on a plain array directly) so static and automatic storage objects of that type will have 32B alignment. std::aligned_storage
documentation has an example of this technique to explain what std::aligned_storage
does.
This doesn't actually work until C++17 for dynamically-allocated storage (like a std::vector<my_class_with_aligned_member_array>
), see Making std::vector allocate aligned memory.
Starting in C++17, the compiler will pick aligned new
for types with alignment enforced by alignas
on the whole type or its member, also std::allocator
will pick aligned new
for such type, so nothing to worry about when creating std::vector
of such types.
And finally, the last option is so bad it's not even part of the list: allocate a larger buffer and do p+=31; p&=~31ULL
with appropriate casting. Too many drawbacks (hard to free, wastes memory) to be worth discussing, since aligned-allocation functions are available on every platform that support Intel _mm256_...
intrinsics. But there are even library functions that will help you do this, IIRC, if you insist.
The requirement to use _mm_free
instead of free
probably exists in part for the possibility of implementing _mm_malloc
on top of a plain old malloc
using this technique. Or for an aligned allocator using an alternate free-list.
AVX: data alignment: store crash, storeu, load, loadu doesn't
TL:DR: in optimized code, loads will fold into memory operands for other operations, which don't have alignment requirements in AVX. Stores won't.
Your sample code doesn't compile by itself, so I can't easily check what instruction _mm256_load_ps
compiles to.
I tried a small experiment with gcc 4.9, and it doesn't generate a vmovaps
at all for _mm256_load_ps
, since I only used the result of the load as an input to one other instruction. It generates that instruction with a memory operand. AVX instructions have no alignment requirements for their memory operands. (There is a performance hit for crossing a cache line, and a bigger hit for crossing a page boundary, but your code still works.)
The store, on the other hand, does generate a vmov...
instruction. Since you used the alignment-required version, it faults on unaligned addresses. Simply use the unaligned version; it'll be just as fast when the address is aligned, and still work when it isn't.
I didn't check your code carefully to see if all the accesses SHOULD be aligned. I assume not, from the way you phrased it to just ask why you weren't also getting faults for unaligned loads. Like I said, probably your code just didn't compile to any vmovaps
load instructions, or else even "aligned" AVX loads don't fault on unaligned addresses.
Are you running AVX (without AVX2 or FMA?) on a Sandy/Ivybridge CPU? I assume that's why your FMA instrinsics are commented out.
Usage of alignas in template argument of std::vector
If alignas(32)double
compiled, it would require that each element separately had 32-byte alignment, i.e. pad each double out to 32 bytes, completely defeating SIMD. (I don't think it will compile, but similar things with GNU C typedef double da __attribute__((aligned(32)))
do compile that way, with sizeof(da) == 32
.)
See Modern approach to making std::vector allocate aligned memory for working code.
As of C++17, std::vector<__m256d>
would work, but is usually not what you want because it makes scalar access a pain.
C++ sucks for this in my experience, although there might be a standard (or Boost) allocator that takes an over-alignment you can use as the second (usually defaulted) template param.
std::vector<double, some_aligned_allocator<32> >
still isn't type-compatible with normal std::vector
, which makes sense because any function that might reallocated it has to maintain alignment. But unfortunately that makes it not type-compatible even for passing to functions that only want read-only access to a std::vector
of double
elements.
Cost of misalignment
For a lot of cases the misalignment is only a couple percent worse than aligned, for AVX/AVX2 loops over an array if data's coming from L3 cache or RAM (on recent Intel CPUs); only with 64-byte vectors do you get a significantly bigger penalty (like 15% or so even when memory bandwidth is still the bottleneck.) You'd hope that the CPU core would have time to deal with it and keep the same number of outstanding off-core transactions in flight. But it doesn't.
For data hot in L1d, misalignment could hurt more even with 32-byte vectors.
In x86-64 code, alignof(max_align_t)
is 16 on mainstream C++ implementations, so in practice even a vector<double>
will end up aligned by 16 at least because the underlying allocator used by new
always aligns at least that much. But that's very often an odd multiple of 16, at least on GNU/Linux. Glibc's allocator (also used by malloc) for large allocations uses mmap
to get a whole range of pages, but it reserves the first 16 bytes for bookkeeping info. This is unfortunate for AVX and AVX-512 because it means your arrays are always misaligned unless you used aligned allocations. (How to solve the 32-byte-alignment issue for AVX load/store operations?)
Mainstream std::vector
implementations are also inefficient when they have to grow: C++ doesn't provide a realloc
equivalent that's compatible with new/delete, so it always has to allocate more space and copy to the start. Never even trying to allocate more space contiguous with the existing mapping (which would be safe even for non-trivially-copyable types), and not using implementation-specific tricks like Linux mremap
to map the same physical pages to a different virtual address without having to copy all those mega/gigabytes. The fact that C++ allows code to redefine operator new
means library implementations of std::vector can't just use a better allocator, either. All of this is a non-problem if you .reserve
the size you're going to need, but it is pretty dumb.
MinGW64 Is Incapable of 32 Byte Stack Alignment (Required for AVX on Windows x64), Easy Work Around or Switch Compilers?
You can solve this problem by switching to Microsoft's 64-bit C/C++ compiler. The problem is not intrinsic to 64-bit Windows. Despite what Kai Tietz said in the bug report you linked, Microsoft's x64 ABI does allow a compiler to give variables a greater than 16-byte alignment on the stack.
Also Cygwin's 64-bit version of GCC 4.9.2 can give variables 32-byte alignment on the stack.
Clang for Windows also makes working executables with AVX, and is a good choice in terms of optimizing well.
Related Topics
What Is a Converting Constructor in C++ ? What Is It For
Gcc Optimization Flag -O3 Makes Code Slower Than -O2
Why Does Printf Not Print Out Just One Byte When Printing Hex
C++ Multiple Definitions of a Variable
How Exactly Does _Attribute_((Constructor)) Work
Are There Benefits of Passing by Pointer Over Passing by Reference in C++
Static Constant String (Class Member)
Do You Use Null or 0 (Zero) For Pointers in C++
C/C++ Macro String Concatenation
C++ "Virtual" Keyword For Functions in Derived Classes. Is It Necessary
What Happens If You Call Erase() on a Map Element While Iterating from Begin to End
How to Get Main Window Handle from Process Id
Undefined Reference to Static Constexpr Char[]
How to Use the Pi Constant in C++