Different Memory Alignment for Different Buffer Sizes

Different memory alignment for different buffer sizes

The x86-64 System V ABI requires 16-byte alignment for local or global arrays that are 16 bytes or larger, and for all C99 VLAs (which are always local).

An array uses the same alignment as its elements, except that a local or global
array variable of length at least 16 bytes or a C99 variable-length array variable
always has alignment of at least 16 bytes.⁴

⁴ The alignment requirement allows the use of SSE instructions when operating on the array.
The compiler cannot in general calculate the size of a variable-length array (VLA), but it is expected
that most VLAs will require at least 16 bytes, so it is logical to mandate that VLAs have at
least a 16-byte alignment.

Fixed-size arrays smaller than one SIMD vector (16 bytes) don't have this requirement, so they can pack efficiently in the stack layout.

Note that this doesn't apply to arrays inside structs, only to locals and globals.

(For dynamic storage, the alignment of a malloc return value must be aligned enough to hold any object up to that size, and since x86-64 SysV has maxalign_t of 16 bytes, malloc must also return 16-byte aligned pointers if the size is 16 or higher. For smaller allocations, it could return only 8B-aligned for an 8B allocation if it wanted to.)

The requirement for local arrays makes it safe to write code that passes their address to a function that requires 16-byte alignment, but this is mostly not something the ABI itself really needs to specify.

It's not something that different compilers have to agree on to link their code together, the way struct layout or the calling convention is (which registers are call-clobbered, or used for arg-passing...). The compiler basically owns the stack layout for the function it's compiling, and other functions can't assume or depend on anything about it. They'd only get pointers to your local vars if you pass pointers as function args, or store pointers into globals.

Specifying it for globals is useful, though: it makes it safe for compiler-generated auto-vectorized code to assume alignment for global arrays, even when it's an extern int[] in an object file compiled by another compiler.

How do I ensure buffer memory is aligned?

Yes, your buffer IS aligned on 64-bits. It's ALSO aligned on a 4 KByte boundary (hence the 0x1000). If you don't want the 4 KB alignement then pass 0x8 instead of 0x1000 ...

Edit: I would also note that usually when writing DMA chains you are writing them through uncached memory or through some kind of non-cache based write queue. If this is the case you want to align your DMA chains to the cache line size as well to prevent a cache write-back overwriting the start or end of your DMA chain.

Allocating memory aligned buffers for SIMD; how does |16 give an odd multiple of 16, and why do it?

Disclaimer

Based on the comment referring to Altivec, this is specific to the Power architecture, which I'm not familiar with. Also, the code is incomplete, but it looks like the allocated memory is organized in one or multiple adjacent buffers, and the size adjustment only works when there are multiple buffers. We don't know how data is accessed in these buffers. There will be a lot of assumptions in this answer, to the point that it may be totally incorrect. I'm posting it mostly because it's too large for a comment.

Answer (sort of)

I can see one possible advantage of the size modification. First, let's remember some details about Power architecture:

Altivec vector size is 16 bytes (128 bits)
Cache line size is 128 bytes

Now, let's take an example that AllocateBuffers allocates memory for 4 buffers (i.e. mABL.mNumberBuffers is 4) and nBytes is 256. Let's see how these buffers are laid out in memory:

| Buffer 1: 256+16=272 bytes | Buffer 2: 272 bytes | Buffer 3: 272 bytes | Buffer 4: 272 bytes |
^                            ^                     ^                     ^
|                            |                     |                     |
offset: 0                    272                   544                   816

Notice the offset values and compare them against cache line boundaries. For simplicity, let's assume the memory is allocated at the cache line boundary. It doesn't really matter, as will be shown below.

Buffer 1 starts at offset 0, which is the beginning of a cache line.
Buffer 2 starts 16 bytes past the cache line boundary (which is at offset 2*128=256).
Buffer 3 starts 32 bytes past the cache line boundary (which is at offset 4*128=512).
Buffer 4 starts 48 bytes past the cache line boundary (which is at offset 6*128=768).

Note how the offset from the nearest cache line boundary increments by 16 bytes. Now, if we assume that data in each of the buffers will be accessed in 16-byte chunks, in forward direction, in a loop then the cache lines are fetched from memory in a rather specific order. Let's consider the middle of the loop (since in the beginning CPU will have to fetch cache lines for the beginning of every buffer):

Iteration 5
- Load from Buffer 1 at offset 5*16=80, we are still using the cache line that was fetched on previous iterations.
- Load from Buffer 2 at offset 352, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 256, we're at its offset 96.
- Load from Buffer 3 at offset 624, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 512, we're at its offset 112.
- Load from Buffer 4 at offset 896, we hit a new cache line boundary and fetch a new cache line from memory.
Iteration 6
- Load from Buffer 1 at offset 6*16=96, we are still using the cache line that was fetched on previous iterations.
- Load from Buffer 2 at offset 368, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 256, we're at its offset 112.
- Load from Buffer 3 at offset 640, we hit a new cache line boundary and fetch a new cache line from memory.
- Load from Buffer 4 at offset 896, we are still using the cache line that was fetched on the last iteration. The cache line boundary is at offset 896, we're at its offset 16.
Iteration 7
- Load from Buffer 1 at offset 7*16=112, we are still using the cache line that was fetched on previous iterations.
- Load from Buffer 2 at offset 384, we hit a new cache line boundary and fetch a new cache line from memory.
- Load from Buffer 3 at offset 656, we are still using the cache line that was fetched on the last iteration. The cache line boundary is at offset 640, we're at its offset 16.
- Load from Buffer 4 at offset 912, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 896, we're at its offset 32.
Iteration 8
- Load from Buffer 1 at offset 8*16=128, we hit a new cache line boundary and fetch a new cache line from memory.
- Load from Buffer 2 at offset 400, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 384, we're at its offset 16.
- Load from Buffer 3 at offset 672, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 640, we're at its offset 32.
- Load from Buffer 4 at offset 944, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 896, we're at its offset 48.

Note that the order in which new cache lines are fetched from memory does not depend on the order of accessing buffers within each loop iteration. Also, it does not depend on whether the whole memory allocation was aligned to a cache line boundary. Also note that if buffer contents were accessed in reverse order then the cache lines would be fetched in forward order, but still in order.

This ordered cache line fetching may aid hardware prefercher in the CPU, so, when the next loop iteration is executed, the required cache line is already pre-fetched. Without it, every 8th iteration of the loop would require 4 new cache lines in whatever order the buffers are accessed by the program, which could be interpreted as random access to memory and hamper the prefetcher. Depending on the loop complexity, this 4 cache line fetch may not be hidden by the out-of-order execution model and introduce a stall. This is less likely to happen when you only fetch up to 1 cache line per iteration.

Another possible benefit is avoiding address aliasing. I don't know cache organization of Power, but if nBytes is a multiple of a page size, using multiple buffers at once, when each buffer is page-aligned, could result in lots of false dependencies and hamper store-to-load forwarding. Though the code does the adjustment not just in case when nBytes is a multiple of a page size, so aliasing probably was not the main concern.

Am I right thinking that the above function will only work correctly based on the assumption that the new operator will return at least 16-byte aligned memory? In C++ the new operator is defined as returning a pointer to storage with alignment suitable for any object with a fundamental alignment requirement, which might not necessarily be 16 bytes.

Yes, C++ does not guarantee any particular alignment, other than it is suitable for storing any object of fundamental type. C++17 adds support for dynamic allocations for over-aligned types.

However, even with older C++ versions, every compiler also adheres to the target system ABI specification, which may specify alignment for memory allocations. In practice, on many systems malloc returns at least 16-byte aligned pointers and operator new uses memory returned by malloc or similar lower level API.

It's not portable though, and therefore not a recommended practice. If you require a particular alignment, either make sure you're compiling for C++17 or use specialized APIs, like posix_memalign.

How does this code find the memory aligned size of a Struct in swift? Why does it need binary operations?

Let's talk first about why you'd want aligned buffers, then we can talk about the bitwise arithmetic.

Our goal is to allocate a Metal buffer that can store three (triple-buffered) copies of our uniforms (so that we can write to one part of the buffer while the GPU reads from another). In order to read from each of these three copies, we supply an offset when binding the buffer, something like currentBufferIndex * uniformsSize. Certain Metal devices require these offsets to be multiples of 256, so we instead need to use something like currentBufferIndex * alignedUniformsSize as our offset.

How do we "round up" an integer to the next highest multiple of 256? We can do it by dropping the lowest 8 bits of the "unaligned" size, effectively rounding down, then adding 256, which gets us the next highest multiple. The rounding down part is achieved by bitwise ANDing with the 1's complement (~) of 255, which (in 32-bit) is 0xFFFFFF00. The rounding up is done by just adding 0x100, which is 256.

Interestingly, if the base size is already aligned, this technique spuriously rounds up anyway (e.g., from 256 to 512). For the cost of an integer divide, you can avoid this waste:

let alignedUniformsSize = ((MemoryLayout<Uniforms>.size + 255) / 256) * 256

Memory alignment today and 20 years ago

What has changed is SSE, which requires 16 byte alignment, this is covered in this older gcc document for -mpreferred-stack-boundary=num which says (emphasis mine):

On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 suffers similar penalties if it is not 16 byte aligned.

This is also backed up by the paper Smashing The Modern Stack For Fun And Profit which covers this an other modern changes that break Smashing the Stack for Fun and Profit.

C++: Alignment when casting byte buffer to another type

Your example is violation of the strict aliasing rule.
So, int64_view anyway will point to the first byte, but it can be unaligned access. Some platforms allow it, some not. Anyway, in C++ it's UB.

For example:

#include <cstdint>
#include <cstddef>
#include <iostream>
#include <iomanip>

#define COUNT 8

struct alignas(1) S
{
    char _pad;
    char buf[COUNT * sizeof(int64_t)];
};

int main()
{
    S s;
    int64_t* int64_view alignas(8) = static_cast<int64_t*>(static_cast<void*>(&s.buf));

    std::cout << std::hex << "s._pad     at " << (void*)(&s._pad) << " aligned as " << alignof(s._pad)     << std::endl;
    std::cout << std::hex << "s.buf      at " << (void*)(s.buf)   << " aligned as " << alignof(s.buf)      << std::endl;
    std::cout << std::hex << "int64_view at " << int64_view       << " aligned as " << alignof(int64_view) << std::endl;

    for(std::size_t i = 0; i < COUNT; ++i)
    {
        int64_view[i] = i;
    }

    for(std::size_t i = 0; i < COUNT; ++i)
    {
        std::cout << std::dec << std::setw(2) << i << std::hex << " " << int64_view + i << " : " << int64_view[i] << std::endl;
    }
}

Now compile and run it with -fsanitize=undefined:

$ g++ -fsanitize=undefined -Wall -Wextra -std=c++20 test.cpp -o test

$ ./test
s._pad     at 0x7ffffeb42300 aligned as 1
s.buf      at 0x7ffffeb42301 aligned as 1
int64_view at 0x7ffffeb42301 aligned as 8
test.cpp:26:23: runtime error: store to misaligned address 0x7ffffeb42301 for type 'int64_t', which requires 8 byte alignment
0x7ffffeb42301: note: pointer points here
 7f 00 00  bf 11 00 00 00 00 00 00  ff ff 00 00 01 00 00 00  20 23 b4 fe ff 7f 00 00  7c a4 9d 2b 98
              ^ 
test.cpp:31:113: runtime error: load of misaligned address 0x7ffffeb42301 for type 'int64_t', which requires 8 byte alignment
0x7ffffeb42301: note: pointer points here
 7f 00 00  bf 00 00 00 00 00 00 00  00 01 00 00 00 00 00 00  00 02 00 00 00 00 00 00  00 03 00 00 00
              ^ 
 0 0x7ffffeb42301 : 0
 1 0x7ffffeb42309 : 1
 2 0x7ffffeb42311 : 2
 3 0x7ffffeb42319 : 3
 4 0x7ffffeb42321 : 4
 5 0x7ffffeb42329 : 5
 6 0x7ffffeb42331 : 6
 7 0x7ffffeb42339 : 7

It works on x86_64, but there is undefined behavior and you pay with execution speed.

This example on godbolt

In C++20 there is bit_cast. It will not help you in this example with unaligned access, but it can resolve some issues with aliasing.

UPDATE:
There is instructions on x86_64, that requires aligned access. For example, SSE, that requires 16-bit alignment. If you will try to use these instructions with unaligned access, application will crash with "general protection fault".

Memory alignment

Blanket statements blaming DMA for large buffer alignment restrictions are wrong.

Hardware DMA transfers are usually aligned on 4 or 8 byte boundaries since the PCI bus can physically transfer 32 or 64bits at a time. Beyond this basic alignment, hardware DMA transfers are designed to work with any address provided.

However, the hardware deals with physical addresses, while the OS deals with virtual memory addresses (which is a protected mode construct in the x86 cpu). This means that a contiguous buffer in process space may not be contiguous in physical ram. Unless care is taken to create physically contiguous buffers, the DMA transfer needs to be broken up at VM page boundaries (typically 4K, possibly 2M).

As for buffers needing to be aligned to disk sector size, this is completely untrue; the DMA hardware is completely oblivious to the physical sector size on a hard drive.

Under Linux 2.4 O_DIRECT required 4K alignment, under 2.6 it's been relaxed to 512B. In either case, it was probably a design decision to prevent single sector updates from crossing VM page boundaries and therefor requiring split DMA transfers. (An arbitrary 512B buffer has a 1/4 chance of crossing a 4K page).

So, while the OS is to blame rather than the hardware, we can see why page aligned buffers are more efficient.

Edit: Of course, if we're writing large buffers anyways (100KB), then the number of VM page boundaries crossed will be practically the same whether we've aligned to 512B or not.
So the main case being optimized by 512B alignment is single sector transfers.

Different Memory Alignment for Different Buffer Sizes