Prohibit Unaligned Memory Accesses on X86/X86_64

Prohibit unaligned memory accesses on x86/x86_64

It's tricky and I haven't done it personally, but I think you can do it in the following way:

x86_64 CPUs (specifically I've checked Intel Corei7 but I guess others as well) have a performance counter MISALIGN_MEM_REF which counter misaligned memory references.

So first of all, you can run your program and use "perf" tool under Linux to get a count of the number of misaligned access your code has done.

A more tricky and interesting hack would be to write a kernel module that programs the performance counter to generate an interrupt on overflow and get it to overflow the first unaligned load/store. Respond to this interrupt in your kernel module but sending a signal to your process.

This will, in effect, turn the x86_64 into a core that doesn't support unaligned access.

This wont be simple though - beside your code, the system libraries also use unaligned accesses, so it will be tricky to separate them from your own code.

INTEL X86，why do align access and non-align access have same performance?

On most modern x86 cores, the performance of aligned and misaligned is the same only if the access does not cross a specific internal boundary.

The exact size of the internal boundary varies based on the core architecture of the relevant CPU, but on Intel CPUs from the last decade, the relevant boundary is the 64-byte cache line. That is, accesses which fall entirely within a 64-byte cache line perform the same regardless of whether they are aligned or not.

However, if a (necessarily misaligned) access crosses a cache line boundary on an Intel chip, however, a penalty is paid of about 2x in both latency and throughput. The bottom-line impact of this penalty depends on the surrounding code and will often be much less than 2x and sometimes close to zero. This modest penalty may be much larger if a 4K page boundary is also crossed.

Aligned accesses never cross these boundaries, so cannot suffer this penalty.

The broad picture is similar for AMD chips, though the relevant boundary as been smaller than 64 bytes on some recent chips, and the boundary is different for loads and stores.

I have included additional details in the load throughput and store throughput sections of a blog post I wrote.

Testing It

Your test wasn't able to show the effect for several reasons:

The test didn't allocate aligned memory, you can't reliably cross a cache line by using an offset from a region with unknown alignment.
You iterated 8 bytes at a time, so the majority of the writes (7 out of 8) will fall in a cache line any have no penalty, leading to a small signal which will only be detectable if the rest of your benchmark is very clean.
You use a large buffer size, which doesn't fit in any level of the cache. The split-line effect is only fairly obvious at the L1, or when splitting lines mean you bring in twice the number of lines (e.g., random access). Since you access every line linearly in either scenario, you'll be limited by throughput from DRAM to the core, regardless of splits or not: the split writes have plenty of time to complete while waiting for main memory.
You use a local volatile auto tmp and tmp++ which creates a volatile on the stack and a lot of loads and stores to preserve volatile semantics: these are all aligned and will wash out the effect you are trying to measure with your test.

Here is my modification of your test, operating only in the L1 region, and which advances 64 bytes at a time, so every store will be a split if any is:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
#include <iomanip>

using namespace std;

static inline int64_t get_time_ns()
{
    std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
    return a.count();
}

int main(int argc, char** argv)
{
    if (argc < 2) {
        cout << "Usage：./test [01234567]" << endl;
        cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
        return 0;
    }
    uint64_t offset = atoi(argv[1]);
    const uint64_t BUFFER_SIZE = 10000;
    alignas(64) uint8_t data_ptr[BUFFER_SIZE];
    memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
    const uint64_t LOOP_CNT = 1000000;
    auto start = get_time_ns();
    for (uint64_t i = 0; i < LOOP_CNT; ++i) {
        uint64_t src = rand();
        for (uint64_t j = offset; j + 64<= BUFFER_SIZE; j+= 64) { // align:offset = 0 nonalign: offset=1-7
            memcpy(data_ptr + j, &src, 8);
        }
    }
    auto end = get_time_ns();
    cout << "time elapsed " << std::setprecision(2) << (end - start) / ((double)LOOP_CNT * BUFFER_SIZE / 64) <<
        "ns per write (rand:" << (int)data_ptr[rand() % BUFFER_SIZE] << ")" << endl;
    return 0;
}

Running this for all alignments in 0 to 64, I get:

$ g++ test.cpp -O2 && for off in {0..64}; do printf "%2d :" $off && ./a.out $off; done
 0 :time elapsed 0.56ns per write (rand:0)
 1 :time elapsed 0.57ns per write (rand:0)
 2 :time elapsed 0.57ns per write (rand:0)
 3 :time elapsed 0.56ns per write (rand:0)
 4 :time elapsed 0.56ns per write (rand:0)
 5 :time elapsed 0.56ns per write (rand:0)
 6 :time elapsed 0.57ns per write (rand:0)
 7 :time elapsed 0.56ns per write (rand:0)
 8 :time elapsed 0.57ns per write (rand:0)
 9 :time elapsed 0.57ns per write (rand:0)
10 :time elapsed 0.57ns per write (rand:0)
11 :time elapsed 0.56ns per write (rand:0)
12 :time elapsed 0.56ns per write (rand:0)
13 :time elapsed 0.56ns per write (rand:0)
14 :time elapsed 0.56ns per write (rand:0)
15 :time elapsed 0.57ns per write (rand:0)
16 :time elapsed 0.56ns per write (rand:0)
17 :time elapsed 0.56ns per write (rand:0)
18 :time elapsed 0.56ns per write (rand:0)
19 :time elapsed 0.56ns per write (rand:0)
20 :time elapsed 0.56ns per write (rand:0)
21 :time elapsed 0.56ns per write (rand:0)
22 :time elapsed 0.56ns per write (rand:0)
23 :time elapsed 0.56ns per write (rand:0)
24 :time elapsed 0.56ns per write (rand:0)
25 :time elapsed 0.56ns per write (rand:0)
26 :time elapsed 0.56ns per write (rand:0)
27 :time elapsed 0.56ns per write (rand:0)
28 :time elapsed 0.57ns per write (rand:0)
29 :time elapsed 0.56ns per write (rand:0)
30 :time elapsed 0.57ns per write (rand:25)
31 :time elapsed 0.56ns per write (rand:151)
32 :time elapsed 0.56ns per write (rand:123)
33 :time elapsed 0.56ns per write (rand:29)
34 :time elapsed 0.55ns per write (rand:0)
35 :time elapsed 0.56ns per write (rand:0)
36 :time elapsed 0.57ns per write (rand:0)
37 :time elapsed 0.56ns per write (rand:0)
38 :time elapsed 0.56ns per write (rand:0)
39 :time elapsed 0.56ns per write (rand:0)
40 :time elapsed 0.56ns per write (rand:0)
41 :time elapsed 0.56ns per write (rand:0)
42 :time elapsed 0.57ns per write (rand:0)
43 :time elapsed 0.56ns per write (rand:0)
44 :time elapsed 0.56ns per write (rand:0)
45 :time elapsed 0.56ns per write (rand:0)
46 :time elapsed 0.57ns per write (rand:0)
47 :time elapsed 0.57ns per write (rand:0)
48 :time elapsed 0.56ns per write (rand:0)
49 :time elapsed 0.56ns per write (rand:0)
50 :time elapsed 0.57ns per write (rand:0)
51 :time elapsed 0.56ns per write (rand:0)
52 :time elapsed 0.56ns per write (rand:0)
53 :time elapsed 0.56ns per write (rand:0)
54 :time elapsed 0.55ns per write (rand:0)
55 :time elapsed 0.56ns per write (rand:0)
56 :time elapsed 0.56ns per write (rand:0)
57 :time elapsed 1.1ns per write (rand:0)
58 :time elapsed 1.1ns per write (rand:0)
59 :time elapsed 1.1ns per write (rand:0)
60 :time elapsed 1.1ns per write (rand:0)
61 :time elapsed 1.1ns per write (rand:0)
62 :time elapsed 1.1ns per write (rand:0)
63 :time elapsed 1ns per write (rand:0)
64 :time elapsed 0.56ns per write (rand:0)

Note that offsets 57 through 63 all take about 2x as long per write, and those are exactly the offsets that cross a 64-byte (cache line) boundary for an 8-byte write.

any way to stop unaligned access from c++ standard library on x86_64?

Like I commented on the question, that asm isn't safe, because it steps on the red-zone. Instead, use

asm volatile ("add $-128, %rsp\n\t"
    "pushf\n\t"
    "orl $0x40000, (%rsp)\n\t"
    "popf\n\t"
    "sub $-128, %rsp\n\t"
    );

(-128 fits in a sign-extended 8bit immediate, but 128 doesn't, hence using add $-128 to subtract 128.)

Or in this case, there are dedicated instructions for toggling that bit, like there are for the carry and direction flags:

asm("stac");   // Set AC flag
asm("clac");   // Clear AC flag

It's a good idea to have some idea when your code uses unaligned memory. It's not necessarily a good idea to change your code to avoid it in every case. Sometimes better locality from packing data closer together is more valuable.

Given that you shouldn't necessarily aim to eliminate all unaligned accesses anyway, I don't think this is the easiest way to find the ones you do have.

modern x86 hardware has fast hardware support for unaligned loads/stores. When they don't span a cache-line boundary, or lead to store-forwarding stalls, there's literally no penalty.

What you might try is looking at performance counters for some of these events:

  misalign_mem_ref.loads     [Speculative cache line split load uops dispatched to L1 cache]
  misalign_mem_ref.stores    [Speculative cache line split STA uops dispatched to L1 cache]

  ld_blocks.store_forward    [This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load.
                             The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceeding smaller uncompleted store.
                             See the table of not supported store forwards in the Intel? 64 and IA-32 Architectures Optimization Reference Manual.
                             The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.]

(from ocperf.py list output on my Sandybridge CPU).

There are probably other ways to detect unaligned memory access. Maybe valgrind? I searched on valgrind detect unaligned and found this mailing list discussion from 13 years ago. Probably still not implemented.

The hand-optimized library functions do use unaligned accesses because it's the fastest way for them to get their job done. e.g. copying bytes 6 to 13 of a string to somewhere else can and should be done with just a single 8-byte load/store.

So yes, you would need special slow&safe versions of library functions.

If your code would have to execute extra instructions to avoid using unaligned loads, it's often not worth it. Esp. if the input is usually aligned, having a loop that does the first up-to-alignment-boundary elements before starting the main loop may just slow things down. In the aligned case, everything works optimally, with no overhead of checking alignment. In the unaligned case, things might work a few percent slower, but as long as the unaligned cases are rare, it's not worth avoiding them.

Esp. if it's not SSE code, since non-AVX legacy SSE can only fold loads into memory operands for ALU instructions when alignment is guaranteed.

The benefit of having good-enough hardware support for unaligned memory ops is that software can be faster in the aligned case. It can leave alignment-handling to hardware, instead of running extra instructions to handle pointers that are probably aligned. (Linus Torvalds had some interesting posts about this on the http://realworldtech.com/ forums, but they're not searchable so I can't find it.

Does unaligned memory access always cause bus errors?

It may be significantly slower to access unaligned memory (as in, several times slower).
Not all platforms even support unaligned access - x86 and x64 do, but ia64 (Itanium) does not, for example.
A compiler can emulate unaligned access (VC++ does that for pointers declared as __unaligned on ia64, for example) - by inserting additional checks to detect the unaligned case, and loading/storing parts of the object that straddle the alignment boundary separately. That is even slower than unaligned access on platforms which natively support it, however.

Prohibit Unaligned Memory Accesses on X86/X86_64