Sse, Intrinsics, and Alignment

SSE, intrinsics, and alignment

First of all you have to care for two types of memory allocation:

Static allocation. For automatic variables to be properly aligned, your type needs a proper alignment specification (e.g. __declspec(align(16)), __attribute__((aligned(16))), or your _MM_ALIGN16). But fortunately you only need this if the alignment requirements given by the type's members (if any) are not sufficient. So you don't need this for you Sphere, given that your Vector3 is already aligned properly. And if your Vector3 contains an __m128 member (which is pretty likely, otherwise I would suggest to do so), then you don't even need it for Vector3. So you usually don't have to mess with compiler specific alignment attributes.
Dynamic allocation. So much for the easy part. The problem is, that C++ uses, on the lowest level, a rather type-agnostic memory allocation function for allocating any dynamic memory. This only guarantees proper alignment for all standard types, which might happen to be 16 bytes but isn't guaranteed to.
For this to compensate you have to overload the builtin operator new/delete to implement your own memory allocation and use an aligned allocation function under the hood instead of good old malloc. Overloading operator new/delete is a topic on its own, but isn't that difficult as it might seem at first (though your example is not enough) and you can read about it in this excellent FAQ question.
Unfortunately you have to do this for each type that has any member needing non-standard alignment, in your case both Sphere and Vector3. But what you can do to make it a bit easier is just make an empty base class with proper overloads for those operators and then just derive all neccessary classes from this base class.
What most people sometimes tend to forget is that the standard allocator std::alocator uses the global operator new for all memory allocation, so your types won't work with standard containers (and a std::vector<Vector3> isn't that rare a use case). What you need to do is make your own standard conformant allocator and use this. But for convenience and safety it is actually better to just specialize std::allocator for your type (maybe just deriving it form your custom allocator) so that it is always used and you don't need to care for using the proper allocator each time you use a std::vector. Unfortunately in this case you have to again specialize it for each aligned type, but a small evil macro helps with that.
Additionally you have to look out for other things using the global operator new/delete instead of your custom one, like std::get_temporary_buffer and std::return_temporary_buffer, and care for those if neccessary.

Unfortunately there isn't yet a much better approach to those problems, I think, unless you are on a platform that natively aligns to 16 and know about this. Or you might just overload the global operator new/delete to always align each memory block to 16 bytes and be free of caring for the alignment of each and every single class containing an SSE member, but I don't know about the implications of this approach. In the worst case it should just result in wasting memory, but then again you usually don't allocate small objects dynamically in C++ (though std::list and std::map might think differently about this).

So to sum up:

Care for proper alignment of static memory using things like __declspec(align(16)), but only if it is not already cared for by any member, which is usually the case.
Overload operator new/delete for each and every type having a member with non-standard alignment requirements.
Make a cunstom standard-conformant allocator to use in standard containers of aligned types, or better yet, specialize std::allocator for each and every aligned type.

Finally some general advice. Often you only profit form SSE in computation-heavy blocks when performing many vector operations. To simplify all this alignment problems, especially the problems of caring for the alignment of each and every type containing a Vector3, it might be a good aproach to make a special SSE vector type and only use this inside of lengthy computations, using a normal non-SSE vector for storage and member variables.

Alignment and SSE strange behaviour

TL:DR: Loads from _mm_load_* intrinsics can be folded (at compile time) into memory operands to other instructions. The AVX versions of vector instructions don't require alignment for memory operands, except for specifically-aligned load/store instructions like vmovdqa.

In the legacy SSE encoding of vector instructions (like pxor xmm0, [src1]) , unaligned 128 bit memory operands will fault except with the special unaligned load/store instructions (like movdqu / movups).

The VEX-encoding of vector instructions (like vpxor xmm1, xmm0, [src1]) doesn't fault with unaligned memory, except with the alignment-required load/store instructions (like vmovdqa, or vmovntdq).

The _mm_loadu_si128 vs. _mm_load_si128 (and store/storeu) intrinsics communicate alignment guarantees to the compiler, but doesn't force it to actually emit a stand-alone load instruction. (Or anything at all if it already has the data in a register, just like dereferencing a scalar pointer).

The as-if rule still applies when optimizing code that uses intrinsics. A load can be folded into a memory operand for the vector-ALU instruction that uses it, as long as that doesn't introduce the risk of a fault. This is advantageous for code-density reasons, and also fewer uops to track in parts of the CPU thanks to micro-fusion (see Agner Fog's microarch.pdf). The optimization pass that does this isn't enabled at -O0, so an unoptimized build of your code probably would have faulted with unaligned src1.

(Conversely, this means _mm_loadu_* can only fold into a memory operand with AVX, but not with SSE. So even on CPUs where movdqu is as fast as movqda when the pointer does happen to be aligned, _mm_loadu can hurt performance because movqdu xmm1, [rsi] / pxor xmm0, xmm1 is 2 fused-domain uops for the front-end to issue while pxor xmm0, [rsi] is only 1. And doesn't need a scratch register. See also Micro fusion and addressing modes).

The interpretation of the as-if rule in this case is that it's ok for the program to not fault in some cases where the naive translation into asm would have faulted. (Or for the same code to fault in an un-optimized build but not fault in an optimized build).

This is opposite from the rules for floating-point exceptions, where the compiler-generated code must still raise any and all exceptions that would have occurred on the C abstract machine. That's because there are well-defined mechanisms for handling FP exceptions, but not for handling segfaults.

Note that since stores can't fold into memory operands for ALU instructions, store (not storeu) intrinsics will compile into code that faults with unaligned pointers even when compiling for an AVX target.

To be specific: consider this code fragment:

// aligned version:
y = ...;                         // assume it's in xmm1
x = _mm_load_si128(Aptr);        // Aligned pointer
res = _mm_or_si128(y, x);

// unaligned version: the same thing with _mm_loadu_si128(Uptr)

When targeting SSE (code that can run on CPUs without AVX support), the aligned version can fold the load into por xmm1, [Aptr], but the unaligned version has to use

movdqu xmm0, [Uptr] / por xmm0, xmm1. The aligned version might do that too, if the old value of y is still needed after the OR.

When targeting AVX (gcc -mavx, or gcc -march=sandybridge or later), all vector instructions emitted (including 128 bit) will use the VEX encoding. So you get different asm from the same _mm_... intrinsics. Both versions can compile into vpor xmm0, xmm1, [ptr]. (And the 3-operand non-destructive feature means that this actually happens except when the original value loaded is used multiple times).

Only one operand to ALU instructions can be a memory operand, so in your case one has to be loaded separately. Your code faults when the first pointer isn't aligned, but doesn't care about alignment for the second, so we can conclude that gcc chose to load the first operand with vmovdqa and fold the second, rather than vice-versa.

You can see this happen in practice in your code on the Godbolt compiler explorer. Unfortunately gcc 4.9 (and 5.3) compile it to somewhat sub-optimal code that generates the return value in al and then tests it, instead of just branching on the flags from vptest :( clang-3.8 does a significantly better job.

.L36:
        add     rdi, 32
        add     rsi, 32
        cmp     rdi, rcx
        je      .L9
.L10:
        vmovdqa xmm0, XMMWORD PTR [rdi]           # first arg: loads that will fault on unaligned
        xor     eax, eax
        vpxor   xmm1, xmm0, XMMWORD PTR [rsi]     # second arg: loads that don't care about alignment
        vmovdqa xmm0, XMMWORD PTR [rdi+16]        # first arg
        vpxor   xmm0, xmm0, XMMWORD PTR [rsi+16]  # second arg
        vpor    xmm0, xmm1, xmm0
        vptest  xmm0, xmm0
        sete    al                                 # generate a boolean in a reg
        test    eax, eax
        jne     .L36                               # then test&branch on it.  /facepalm

Note that your is_equal is memcmp. I think glibc's memcmp will do better than your implementation in many cases, since it has hand-written asm versions for SSE4.1 and others which handle various cases of the buffers being misaligned relative to each other. (e.g. one aligned, one not.) Note that glibc code is LGPLed, so you might not be able to just copy it. If your use-case has smaller buffers that are typically aligned, your implementation is probably good. Not needing a VZEROUPPER before calling it from other AVX code is also nice.

The compiler-generated byte-loop to clean up at the end is definitely sub-optimal. If the size is bigger than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you re-compared some bytes you've already checked.

Anyway, definitely benchmark your code with the system memcmp. Besides the library implementation, gcc knows what memcmp does and has its own builtin definition that it can inline code for.

Why does Clang complain about alignment on SSE intrinsic unaligned loads

alignof(__m128i) == 16. That cast happens before the __m128i* is passed as an argument to _mm_loadu_si128, which casts it again, not actually dereferencing the __m128i*.

As @chtz points out, you could maybe work around this for clang by casting instead to __m128i_u const *. GCC/clang define those types with __attribute__((may_alias,aligned(1),vector_size(16))), unlike the standard __m128i type which doesn't override the alignment-requirement. But I don't think MSVC defines a __m128i_u, so that wouldn't be portable.

You're right there is no actual problem, just an artifact of Intel's poor design for their intrinsics API where even the unaligned-load intrinsics take a pointer that wouldn't be safe to dereference on its own. (For AVX-512, the new intrinsics take void* instead, also avoiding the need for stupid casting, but they didn't retroactively change the old intrinsics to take void*.)

If clang's warning checker followed the chain of usages of that pointer value, it would see that it's not dereferenced. But it doesn't do that, instead it warns you on the spot about having created a pointer that might not be safe to deref. That's normally not something you want to do, but as I said you're forced to do it by Intel's clunky API.

Related: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? discusses the behaviour that compilers must define as part of supporting the intrinsics API, including creating misaligned pointers. It's ISO C UB to simple create a misaligned int * even without dereferencing, but obviously the intrinsics API requires you to create misaligned __m128i* pointers to use loadu / storeu. (And potentially misaligned float* to use _mm_loadu_ps on bytes that weren't a valid aligned float object, but the intrinsic doesn't deref the float*, instead it casts to __m128_u*)

How to align 16-bit ints for use with SSE intrinsics

SSE needs data to be aligned on 16 bytes boundary, not 16 bits, that's your problem.

What you're looking for to align your static arrays is compiler dependent.

If you're using MSVC, you'll have to use __declspec(align(16)), or with GCC, this would be __attribute__((aligned (16))).

Is there a way to force visual studio to generate aligned instructions from SSE intrinsics?

MSVC and ICC only use instructions that do alignment checking when they fold a load into a memory source operand without AVX enabled, like addps xmm0, [rax]. SSE memory source operands require alignment, unlike AVX. But you can't reliably control when this happens, and in debug builds it generally doesn't.

As Mysticial points out in Visual Studio 2017: _mm_load_ps often compiled to movups , another case is NT load/store, because there is no unaligned version.

If your code is compatible with clang-cl, have Visual Studio use it instead of MSVC. It's a modified version of clang that tries to act more like MSVC. But like GCC, clang uses aligned load and store instructions for aligned intrinsics.

Either disable optimization, or make sure AVX is not enabled, otherwise it could fold a _mm_load_ps into a memory source operand like vaddps xmm0, [rax] which doesn't require alignment because it's the AVX version. This may be a problem if your code also uses AVX intrinsics in the same file, because clang requires that you enable ISA extensions for intrinsics you want to use; the compiler won't emit asm instructions for an extension that isn't enabled, even with intrinsics. Unlike MSVC and ICC.

A debug build should work even with AVX enabled, especially if you _mm_load_ps or _mm256_load_ps into a separate variable in a separate statement, not v=_mm_add_ps(v, _mm_load_ps(ptr));

With MSVC itself, for debugging purposes only (usually very big speed penalty for stores), you could substitute normal loads/stores with NT. Since they're special, the compiler won't fold loads into memory source operands for ALU instructions, so this can maybe work even with AVX with optimization enabled.

// alignment_debug.h      (untested)
// #include this *after* immintrin.h
#ifdef DEBUG_SIMD_ALIGNMENT
 #warn "using slow alignment-debug SIMD instructions to work around MSVC/ICC limitations"
   // SSE4.1 MOVNTDQA doesn't do anything special on normal WB memory, only WC
   // On WB, it's just a slower MOVDQA, wasting an ALU uop.
 #define _mm_load_si128  _mm_stream_load_si128
 #define _mm_load_ps(ptr)  _mm_castsi128_ps(_mm_stream_load_si128((const __m128i*)ptr))
 #define _mm_load_pd(ptr)  _mm_castsi128_pd(_mm_stream_load_si128((const __m128i*)ptr))

  // SSE1/2 MOVNTPS / PD / MOVNTDQ  evict data from cache if it was hot, and bypass cache
 #define _mm_store_ps  _mm_stream_ps       // SSE1 movntps
 #define _mm_store_pd  _mm_stream_pd       // SSE2 movntpd is a waste of space vs. the ps encoding, but whatever
 #define _mm_store_si128 _mm_stream_si128  // SSE2 movntdq

// and repeat for _mm256_... versions with _mm256_castsi256_ps
// and _mm512_... versions 
// edit welcome if anyone tests this and adds those versions
#endif

Related: for auto-vectorization with MSVC (and gcc/clang), see Alex's answer on Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang

Choice between aligned vs. unaligned x86 SIMD instructions

Unaligned access: Only movups/vmovups can be used. The same penalties discussed in the aligned access case (see next) apply here too. In addition, accesses that cross a cache line or virtual page boundary always incur penalty on all processors.
Aligned access:
- On Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later: After predecoding, they are executed in the same exact way for the same operands. This includes support for move elimination. For the fetch and predecode stages, they consume the same exact resources for the same operands.
- On pre-Nehalem and Bonnell and pre-Bulldozer: They get decoded into different fused domain uops and unfused domain uops. movups/vmovups consume more resources (up to twice as much) in the frontend and the backend of the pipeline. In other words, movups/vmovups can be up to twice as slow as movaps/vmovaps in terms of latency and/or throughput.

Therefore, if you don't care about the older microarchitectures, both are technically equivalent. Although if you know or expect the data to be aligned, you should use the aligned instructions to ensure that the data is indeed aligned without having to add explicit checks in the code.

SSE alignment of 3D vector

Actually we use SIMD at work and maybe I can give you my feedback on it.
The alignement is something you have to take care of when dealing with SIMD, this is to ensure cache line alignement.
However I am not sure if it will still cause a crash if it's not aligned or if the CPU is able to manage anyway (like not aligned scalar types in the old time, it was causing crash, now the CPU handles it but it slows down performances).
Maybe you can look here SSE, intrinsics, and alignment
It seems to have good answers for the alignement part of the question.

For the fact you are using it as a 3D vector even if it's physically a 4D vector, it's not a really good practice, because you don't profit of the all performance of SIMD instructions. The best way for it to match is to use Structure Of Arrays (SOA).

Note: I am assuming 128 bits SIMD registers mapped to 4 scalar types (int or float)

For example, if you have 4 3D points (or vectors), following your way, you will have 4 4D vectors ignoring the 4th component of each point.
In total you end up with 4 * 4 values accessible.

By using SOA, you will have 3 SIMD 128 bits (12 values) registers and you will store your points in the following way.
SIMD

r1: x x x x

r2: y y y y

r3: z z z z

This way you fill the entire SIMD registers and thus profit at maximum of SIMD advantages. The other thing is that many of the calculations you will have to make (example add 2 groups of 4 vectors) will only take 3 SIMD instructions. It's a bit tricky to use and understand but when you do, the gain is great.

Of course you won't be able to use it this way in all cases so you will fall back to the original solution of ignoring the last value.

G++ SSE memory alignment on the stack

The simplest way is std::aligned_storage, which takes alignment as a second parameter.

If you don't have it yet, you might want to check Boost's version.

Then you can build your union:

union vector {
  __m128 simd;
  std::aligned_storage<16, 16> alignment_only;
}

Finally, if it does not work, you can always create your own little class:

template <typename Type, intptr_t Align> // Align must be a power of 2
class RawStorage
{
public:
  Type* operator->() {
    return reinterpret_cast<Type const*>(aligned());
  }

  Type const* operator->() const {
    return reinterpret_cast<Type const*>(aligned());
  }

  Type& operator*() { return *(operator->()); }
  Type const& operator*() const { return *(operator->()); }

private:
  unsigned char* aligned() {
    if (data & ~(Align-1) == data) { return data; }
    return (data + Align) & ~(Align-1);
  }

  unsigned char data[sizeof(Type) + Align - 1];
};

It will allocate a bit more storage than necessary, but this way alignment is guaranteed.

int main(int argc, char* argv[])
{
  RawStorage<__m128, 16> simd;
  *simd = /* ... */;

  return 0;
}

With luck, the compiler might be able to optimize away the pointer alignment stuff if it detects the alignment is necessary right.

Sse, Intrinsics, and Alignment