How and When to Align to Cache Line Size

How and when to align to cache line size?

It's done this way so that different cores modifying different fields won't have to bounce the cache line containing both of them between their caches. In general, for a processor to access some data in memory, the entire cache line containing it must be in that processor's local cache. If it's modifying that data, that cache entry usually must be the only copy in any cache in the system (Exclusive mode in the MESI/MOESI-style cache coherence protocols). When separate cores try to modify different data that happens to live on the same cache line, and thus waste time moving that whole line back and forth, that's known as false sharing.

In the particular example you give, one core can be enqueueing an entry (reading (shared) buffer_ and writing (exclusive) only enqueue_pos_) while another dequeues (shared buffer_ and exclusive dequeue_pos_) without either core stalling on a cache line owned by the other.

The padding at the beginning means that buffer_ and buffer_mask_ end up on the same cache line, rather than split across two lines and thus requiring double the memory traffic to access.

I'm unsure whether the technique is entirely portable. The assumption is that each cacheline_pad_t will itself be aligned to a 64 byte (its size) cache line boundary, and hence whatever follows it will be on the next cache line. So far as I know, the C and C++ language standards only require this of whole structures, so that they can live in arrays nicely, without violating alignment requirements of any of their members. (see comments)

The attribute approach would be more compiler specific, but might cut the size of this structure in half, since the padding would be limited to rounding up each element to a full cache line. That could be quite beneficial if one had a lot of these.

The same concept applies in C as well as C++.

Align struct while minimizing cache line waste

To address (2), it is unclear whether the extra overhead of using packed structs (e.g., unaligned 64-bit accesses) and the extra math to access array elements will be worth it. But if you want to try it, you can create a new struct to pack your struct elements appropriately, then create a small wrapper class to access the elements like you would an array:

    #include <array>
    #include <iostream>
    
    using namespace std;
    
    template <typename T, size_t BlockAlignment>
    struct __attribute__((packed)) Packer
    {
        static constexpr size_t NUM_ELEMS = BlockAlignment / sizeof(T);
        static_assert( NUM_ELEMS > 0, "BlockAlignment too small for one object." );
        T &operator[]( size_t index ) { return packed[index]; }
        
        T packed[ NUM_ELEMS ];
        uint8_t padding[ BlockAlignment - sizeof(T)*NUM_ELEMS ];
    };
    
    template <typename T, size_t NumElements, size_t BlockAlignment>
    struct alignas(BlockAlignment) PackedAlignedArray
    {
        typedef Packer<T, BlockAlignment> PackerType;
        std::array< PackerType, NumElements / PackerType::NUM_ELEMS + 1 > packers;
        T &operator[]( size_t index ) { 
            return packers[ index / PackerType::NUM_ELEMS ][ index % PackerType::NUM_ELEMS ];
        }
    };
    
    struct __attribute__((packed)) Foo
    {
        uint64_t a;
        uint64_t b;
        uint8_t c;
    };
    
    int main()
    {
        static_assert( sizeof(Foo) == 17, "Struct not packed for test" );
        constexpr size_t NUM_ELEMENTS = 10;
        constexpr size_t BLOCK_ALIGNMENT = 64;

        PackedAlignedArray<Foo, NUM_ELEMENTS, BLOCK_ALIGNMENT> theArray;

        for ( size_t i=0; i<NUM_ELEMENTS; ++i )
        {
            // Display the memory offset between the current
            // element and the start of the array
            cout << reinterpret_cast<std::ptrdiff_t>(&theArray[i]) - 
                reinterpret_cast<std::ptrdiff_t>(&theArray[0]) << std::endl;
        }
    
        return 0;
    }

The output of the program shows the byte offsets of the addresses in memory of the the 17-byte elements, automatically resetting to a multiple of 64 every four elements:

x86 Memory Alignment of struct vs. cache line?

I think the part you might be missing is the alignment requirement that the compiler imposes for various types.

Integer types are generally aligned to a multiple of their own size (e.g. a 64-bit integer will be aligned to 8 bytes); so-called "natural alignment". This is not a strict architectural requirement of x86; unaligned loads and stores still work, but since they are less efficient, the compiler prefers to avoid them.

An aggregate, like a struct, is aligned according to the highest alignment requirement of its members, and padding will be inserted between members if needed to ensure that each one is properly aligned. Padding will also be added at the end so that the overall size of the struct is a multiple of its required alignment.

So in your example, struct Obj has alignment 8, and its size will be rounded up to 48 (with 6 bytes of padding at the end). So there is no need for 24 bytes of padding to be inserted after c[4] (I think you meant to write the padding at addresses 40-63); your obj can be placed at address 40. d can then be placed at address 88.

Note that none of this has anything to do with the cache line size. Objects are not by default aligned to cache lines, though "natural alignment" will ensure that no integer load or store ever has to cross a cache line.

Aligning to cache line and knowing the cache line size

To know the sizes, you need to look it up using the documentation for the processor, afaik there is no programatic way to do it. On the plus side however, most cache lines are of a standard size, based on intels standards. On x86 cache lines are 64 bytes, however, to prevent false sharing, you need to follow the guidelines of the processor you are targeting (intel has some special notes on its netburst based processors), generally you need to align to 64 bytes for this (intel states that you should also avoid crossing 16 byte boundries).

To do this in C or C++ requires that you use the standard aligned_alloc function or one of the compiler specific specifiers such as __attribute__((align(64))) or __declspec(align(64)). To pad between members in a struct to split them onto different cache lines, you need on insert a member big enough to align it to the next 64 byte boundery

What does cacheline aligned mean?

CPU caches transfer data from and to main memory in chunks called cache lines; a typical size for this seems to be 64 bytes.

Data that are located closer to each other than this may end up on the same cache line.

If these data are needed by different cores, the system has to work hard to keep the data consistent between the copies residing in the cores' caches. Essentially, while one thread modifies the data, the other thread is blocked by a lock from accessing the data.

The article you reference talks about one such problem that was found in PostgreSQL in a data structure in shared memory that is frequently updated by different processes. By introducing padding into the structure to inflate it to 64 bytes, it is guaranteed that no two such data structures end up in the same cache line, and the processes that access them are not blocked more that absolutely necessary.

This is only relevant if your program parallelizes execution and accesses a shared memory region, either by multithreading or by multiprocessing with shared memory. In this case you can benefit by making sure that data that are frequently accessed by different execution threads are not located close enough in memory that they can end up in the same cache line.

The typical way to do that is by adding “dead” padding space at the end of a data structure.

I found some interesting articles on the topic that you may want to read:

http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273?pgno=3

http://www.drdobbs.com/tools/memory-constraints-on-thread-performance/231300494

http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206

Should the cache padding size of x86-64 be 128 bytes?

Intel's optimization manual does describe the L2 spatial prefetcher in SnB-family CPUs. Yes, it tries to complete 128B-aligned pairs of 64B lines, when there's spare memory bandwidth (off-core request tracking slots) when the first line is getting pulled in.

Your microbenchmark doesn't show any significant time difference between 64 vs. 128 byte separation. Without any actual false sharing (within the same 64 byte line), after some initial chaos, it quickly reaches a state where each core has exclusive ownership of the cache line it's modifying. That means no further L1d misses, thus no requests to L2 which would trigger the L2 spatial prefetcher.

Unlike if for example two pairs of threads contending over separate atomic<int> variables in adjacent (or not) cache lines. Or false-sharing with them. Then L2 spatial prefetching could couple the contention together, so all 4 threads are contending with each other instead of 2 independent pairs. Basically any case where cache lines actually are bouncing back and forth between cores, L2 spatial prefetching can make it worse if you aren't careful.

(The L2 prefetcher doesn't keep trying indefinitely to complete pairs of lines of every valid line it's caching; that would hurt cases like this where different cores are repeatedly touching adjacent lines, more than it helps anything.)

Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size includes an answer with a longer benchmark; I haven't looked at it recently, but I think it's supposed to demo destructive interference at 64 bytes but not 128. The answer there unfortunately doesn't mention L2 spatial prefetching as one of the effects that can cause some destructive interference (although not as much as a 128-byte line size in an outer level of cache, especially not if it's an inclusive cache).

Perf counters reveal a difference even with your benchmark

There is more initial chaos that we can measure with performance counters for your benchmark. On my i7-6700k (quad core Skylake with Hyperthreading; 4c8t, running Linux 5.16), I improved the source so I could compile with optimization without defeating the memory access, and with a CPP macro so I could set the stride (in bytes) from the compiler command line. Note the ~500 memory-order mis-speculation pipeline nukes (machine_clears.memory_ordering) when we use adjacent lines. Actual number is quite variable, from 200 to 850, but there's still negligible effect on the overall time.

Adjacent lines, 500 +- 300 machine clears

$ g++ -DSIZE=64 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.22 msec task-clock                #    3.958 CPUs utilized            ( +-  0.12% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               126      page-faults               #  224.752 /sec                     ( +-  0.35% )
     2,180,391,747      cycles                    #    3.889 GHz                      ( +-  0.12% )
     2,003,039,378      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,118,661      uops_issued.any           #    2.861 G/sec                    ( +-  0.00% )
     2,003,739,959      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
               494      machine_clears.memory_ordering #  881.172 /sec                     ( +-  9.00% )

          0.141534 +- 0.000342 seconds time elapsed  ( +-  0.24% )

vs with 128-byte separation, only a very few machine clears

$ g++ -DSIZE=128 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.01 msec task-clock                #    3.957 CPUs utilized            ( +-  0.13% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               124      page-faults               #  221.203 /sec                     ( +-  0.16% )
     2,180,048,243      cycles                    #    3.889 GHz                      ( +-  0.13% )
     2,003,038,553      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,084,990      uops_issued.any           #    2.862 G/sec                    ( +-  0.00% )
     2,003,707,895      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
                22      machine_clears.memory_ordering #   39.246 /sec                     ( +-  9.68% )

          0.141506 +- 0.000342 seconds time elapsed  ( +-  0.24% )

Presumably there's some dependence on how Linux schedules threads to logical cores on this 4c8t machine. Related:

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? - actual false sharing within one line is much worse for logical cores sharing a physical core, but for adjacent lines there's probably no effect: the L2 is the same per physical core, and both lines will stay hot in L1d.
Why flush the pipeline for Memory Order Violation caused by other logical processors?

vs. actual false-sharing within one line: 10M machine clears

The store buffer (and store forwarding) get a bunch of increments done for every false-sharing machine clear so it's not as bad as one might expect. (And would be far worse with atomic RMWs, like std::atomic<int> fetch_add, since there every single increment needs direct access to L1d cache as it executes.) Why does false sharing still affect non atomics, but much less than atomics?

$ g++ -DSIZE=4 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            809.98 msec task-clock                #    3.835 CPUs utilized            ( +-  0.42% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               122      page-faults               #  152.953 /sec                     ( +-  0.22% )
     3,152,973,230      cycles                    #    3.953 GHz                      ( +-  0.42% )
     2,003,038,681      instructions              #    0.65  insn per cycle           ( +-  0.00% )
     2,868,628,070      uops_issued.any           #    3.596 G/sec                    ( +-  0.41% )
     2,934,059,729      uops_executed.thread      #    3.678 G/sec                    ( +-  0.30% )
        10,810,169      machine_clears.memory_ordering #   13.553 M/sec                    ( +-  0.90% )

           0.21123 +- 0.00124 seconds time elapsed  ( +-  0.59% )

Improved benchmark- align the array, and volatile to allow optimization

I used volatile so I could enable optimization. I assume you compiled with optimization disabled, so int j was also getting stored/reloaded inside the loop.

And I used alignas(128) counter[] so we'd be sure the start of the array was in two pairs of 128-byte lines, not spread across three.

#include <thread>

alignas(128) volatile int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

static const int stride = SIZE/sizeof(counter[0]);
int main() {
    std::thread t1(update, 0*stride);
    std::thread t2(update, 1*stride);
    std::thread t3(update, 2*stride);
    std::thread t4(update, 3*stride);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

Will the cache line aligned memory allocation pay off?

Most of the discussions on cache line alignment deal with high-performance computing working with many threads, and keeping scalability as close to linear as possible. In those discussions the reason for cache line alignment is to prevent a write to one data variable invalidating the cache line that also contains another variable used by a different thread.

So, unless you are trying to write code that will scale to a very high number of processor cores, cache line alignment probably won't matter much to you. but again, test it and see.

Can multiple aligned attributes in gcc be used to guarantee cache line separation?

I have a struct with 2 members, which I want on different cache lines
where a cache line is 64 Bytes. I assume the following is not good
enough because it will only guarantee the alignment of a single
member:
struct alignTo64ByteCacheLine_BAD {
   int _onCacheLine1
   int _onCacheLine2 __attribute__((aligned(64)))
}

You are mistaken: the above code will result in both members being aligned on 64-bit addresses. This follows from the facts that

A structure's alignment requirement has to be a multiple of the alignment requirement of each member (else at least one member's alignment cannot be guaranteed), and
There cannot be padding before the first member.

With the struct having only two members, therefore, specifying an alignment for the second ensures that the first will have at least as strict an alignment. Thus, this alternative has the same practical effect as each of your other two.

Moreover, that ensures that each member is at the beginning of its cache line, which is a stronger requirement than you expressed. If you indeed need only that the members be on different cache lines then aligning only the second would be sufficient anyway, because the first member has to be laid out ahead of the second in memory.

Note also, however, that this appears extremely inefficient. The alignment requirements will require the compiler to pad this structure to an overall size of (at least) 128 bytes, of which only 8 are used (since we're talking about GCC, we know we have 4-byte ints). Each member will be the only thing in its cache line. You didn't say why you want this, but it seems likely to impact your cache hit rate pretty hard.

Also, will the aligned(64) attribute on the struct itself pad the
struct out to a multiple of 64-bytes for use in arrays, or does that
have to be done manually?

The struct size will be padded to a multiple of its alignment requirement, exactly so that arrays of the type do not force any elements to be misaligned.

How and When to Align to Cache Line Size