Is 'Reinterpret_Cast'Ing Between Hardware Simd Vector Pointer and the Corresponding Type an Undefined Behavior

Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

ISO C++ doesn't define __m256, so we need to look at what does define their behaviour on the implementations that support them.

Intel's intrinsics define vector-pointers like __m256* as being allowed to alias anything else, the same way ISO C++ defines char* as being allowed to alias.

So yes, it's safe to dereference a __m256* instead of using a _mm256_load_ps() aligned-load intrinsic.

But especially for float/double, it's often easier to use the intrinsics because they take care of casting from float*, too. For integers, the AVX512 load/store intrinsics are defined as taking void*, but before that you need an extra (__m256i*) which is just a lot of clutter.

In gcc, this is implemented by defining __m256 with a may_alias attribute: from gcc7.3's avxintrin.h (one of the headers that <immintrin.h> includes):

/* The Intel API is flexible enough that we must allow aliasing with other
   vector types, and their scalar components.  */
typedef float __m256 __attribute__ ((__vector_size__ (32),
                                     __may_alias__));
typedef long long __m256i __attribute__ ((__vector_size__ (32),
                                          __may_alias__));
typedef double __m256d __attribute__ ((__vector_size__ (32),
                                       __may_alias__));

/* Unaligned version of the same types.  */
typedef float __m256_u __attribute__ ((__vector_size__ (32),
                                       __may_alias__,
                                       __aligned__ (1)));
typedef long long __m256i_u __attribute__ ((__vector_size__ (32),
                                            __may_alias__,
                                            __aligned__ (1)));
typedef double __m256d_u __attribute__ ((__vector_size__ (32),
                                         __may_alias__,
                                         __aligned__ (1)));

(In case you were wondering, this is why dereferencing a __m256* is like _mm256_store_ps, not storeu.)

GNU C native vectors without may_alias are allowed to alias their scalar type, e.g. even without the may_alias, you could safely cast between float* and a hypothetical v8sf type. But may_alias makes it safe to load from an array of int[], char[], or whatever.

I'm talking about how GCC implements Intel's intrinsics only because that's what I'm familiar with. I've heard from gcc developers that they chose that implementation because it was required for compatibility with Intel.

Other behaviour Intel's intrinsics require to be defined

Using Intel's API for _mm_storeu_si128( (__m128i*)&arr[i], vec); requires you to create potentially-unaligned pointers which would fault if you deferenced them. And _mm_storeu_ps to a location that isn't 4-byte aligned requires creating an under-aligned float*.

Just creating unaligned pointers, or pointers outside an object, is UB in ISO C++, even if you don't dereference them. I guess this allows implementations on exotic hardware which do some kinds of checks on pointers when creating them (possibly instead of when dereferencing), or maybe which can't store the low bits of pointers. (I have no idea if any specific hardware exists where more efficient code is possible because of this UB.)

But implementations which support Intel's intrinsics must define the behaviour, at least for the __m* types and float*/double*. This is trivial for compilers targeting any normal modern CPU, including x86 with a flat memory model (no segmentation); pointers in asm are just integers kept in the same registers as data. (m68k has address vs. data registers, but it never faults from keeping bit-patterns that aren't valid addresses in A registers, as long as you don't deref them.)

Going the other way: element access of a vector.

Note that may_alias, like the char* aliasing rule, only goes one way: it is not guaranteed to be safe to use int32_t* to read a __m256. It might not even be safe to use float* to read a __m256. Just like it's not safe to do char buf[1024]; int *p = (int*)buf;.

See GCC AVX _m256i cast to int array leads to wrong values for a real-world example of GCC breaking code that points an int* into a __m256i vec; object. Not a dereferenced __m256i* ; that would be safe if the only __m256i accesses were via __m256i*. Because it's a may_alias type, the compiler can't infer that the underlying object is an __m256i; that's the whole point, and why it's safe to point it at an int arr[] or whatever.

Reading/writing through a char* can alias anything, but when you have a char object, strict-aliasing does make it UB to read it through other types. (I'm not sure if the major implementations on x86 do define that behaviour, but you don't need to rely on it because they optimize away memcpy of 4 bytes into an int32_t. You can and should use memcpy to express an unaligned load from a char[] buffer, because auto-vectorization with a wider type is allowed to assume 2-byte alignment for int16_t*, and make code that fails if it's not: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?)

A char arr[] may not be a great analogy because arr[i] is defined in terms of *(arr+i), so there actually is a char* deref involved in accessing the array as char objects. Perhaps some char members of a struct would be a better example, then.

To insert/extract vector elements, use shuffle intrinsics, SSE2 _mm_insert_epi16 / _mm_extract_epi16 or SSE4.1 insert / _mm_extract_epi8/32/64. For float, there are no insert/extract intrinsics that you should use with scalar float.

Or store to an array and read the array. (print a __m128i variable). This does actually optimize away to vector extract instructions.

GNU C vector syntax provides the [] operator for vectors, like __m256 v = ...; v[3] = 1.25;. MSVC defines vector types as a union with a .m128_f32[] member for per-element access.

There are wrapper libraries like Agner Fog's (GPL licensed) Vector Class Library which provide portable operator[] overloads for their vector types, and operator + / - / * / << and so on. It's quite nice, especially for integer types where having different types for different element widths make v1 + v2 work with the right size. (GNU C native vector syntax does that for float/double vectors, and defines __m128i as a vector of signed int64_t, but MSVC doesn't provide operators on the base __m128 types.)

You can also use union type-punning between a vector and an array of some type, which is safe in ISO C99, and in GNU C++, but not in ISO C++. I think it's officially safe in MSVC, too, because I think the way they define __m128 as a normal union.

There's no guarantee you'll get efficient code from any of these element-access methods, though. Do not use inside inner loops, and have a look at the resulting asm if performance matters.

Is casting to simd-type undefined behaviour in C++?

Intel's intrinsics API does define the behaviour of casting to __m128* and dereferencing: it's identical to _mm_load_ps on the same pointer.

For float* and double*, the load/store intrinsics basically exist to wrap this reinterpret cast and communicate alignment info to the compiler.

If _mm_load_ps() is supported, the implementation must also define the behaviour of the code in the question.

I don't know if this is actually documented anywhere; maybe in an Intel tutorial or whitepaper, but it's the agreed-upon behaviour of all compilers and I think most people would agree that a compiler that didn't define this behaviour didn't fully support Intel's intrinsics API.

__m128 types are defined as may_alias¹, so like char* you can point a __m128* at anything, including int[] or an arbitrary struct, and load or store through it without violating strict-aliasing. (As long as it's aligned by 16, otherwise you do need _mm_loadu_ps, or a custom vector type declared with something like GNU C's aligned(1) attribute).

Footnote 1: __attribute__((vector_size(16), may_alias)) in GNU C, and MSVC doesn't do type-based alias analysis.

Is reinterpret_cast safe or undefined on sse/avx types?

No it's not portable and the behavior is undefined; __m128 is for float and __m128i is for integer types, these are not compatible types.

In fact, it doesn't even compile in MSVC 2017:

error C2440: 'reinterpret_cast': cannot convert from '__m128' to '__m128i'

Use the cast intrinsic:

__m128 a = something;
__m128i b = _mm_castps_si128(a);

Gcc misoptimises sse function

The problem is that you're using short* to access the elements of a __m128i* object. That violates the strict-aliasing rule. It's only safe to go the other way, using __m128i* dereference or more normally _mm_load_si128( (const __m128i*)ptr ).

__m128i* is exactly like char* - you can point it at anything, but not vice versa: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

The only standard blessed way to do type punning is with memcpy:

    memcpy(v00, lows, its * sizeof(short));
    memcpy(v10, highs, its * sizeof(short));
    memcpy(reinterpret_cast<short*>(v00) + its, lows + its - 1, sizeof(short));
    memcpy(reinterpret_cast<short*>(v10) + its, highs + its - 1,  sizeof(short));

https://godbolt.org/z/f63q7x

I prefer just using aligned memory of the correct type directly:

    alignas(16) short v00[16];
    alignas(16) short v10[16];
    auto mv00 = reinterpret_cast<__m128i*>(v00);
    auto mv10 = reinterpret_cast<__m128i*>(v10);
    _mm_store_si128(mv00, _mm_setzero_si128());
    _mm_store_si128(mv10, _mm_setzero_si128());
    _mm_store_si128(mv00 + 1, _mm_setzero_si128());
    _mm_store_si128(mv10 + 1, _mm_setzero_si128());

    for (int i = 0; i < its; ++i) {
        v00[i] = lows[i];
        v10[i] = highs[i];
    }

    v00[its] = v00[its - 1];
    v10[its] = v10[its - 1];

https://godbolt.org/z/bfanne

I'm not positive that this setup is actually standard-blessed (it definitely is for _mm_load_ps since you can do it without type punning at all) but it does seem to also fix the issue. I'd guess that any reasonable implementation of the load/store intrinsics is going to have to provide the same sort of aliasing guarantees that memcpy does since it's more or less the kosher way to go from straight line to vectorized code in x86.

As you mentioned in your question, you can also force the alignment with a union, and I've used that too in pre c++11 contexts. Even in that case though, I still personally always write the loads and stores explicitly (even if they're just going to/from aligned memory) because issues like this tend to pop up if you don't.

Cast array of wrapper structs to SIMD vector

This is fully safe

You're not directly dereffing the float*, only passing it to _mm256_load_ps which does an aliasing-safe load. In terms of language-lawyering, you can look at _mm256_load_ps / _mm256_store_ps as doing a memcpy (to a private local variable), except it's UB if the pointer isn't 32-byte aligned.

Interconvertibility between Wrapper* and float* isn't really relevant; you're not derefing a float*.

If you'd been using _mm_load_ss(arr) on a buggy GCC version that implements it as _mm_set_ss( *ptr ) instead using a may_alias typdef for float, then that would matter. (Unfortunately even current GCC still has that bug; _mm_loadu_si32 was fixed in GCC11.3 but not the older _ss and _sd loads.) But that is a compiler bug, IMO. _mm_load_ps is aliasing-safe, so it makes no sense that _mm_load_ss wouldn't be, when they both take float*. If you wanted a load with normal C aliasing/alignment semantics to promise more to the optimizer, you'd just deref yourself, using _mm_set_ss( *foo ).

The exact aliasing semantics of Intel Intrinsics are not AFAIK documented anywhere. A lot of x86-specific code has been developed with MSVC, which doesn't enforce strict aliasing at all, i.e. it's like gcc -fno-strict-aliasing, defining the behaviour of stuff like *(int*)my_float and even encouraging it for type-punning.

Not sure about Intel's compiler historically, but I'm guessing it also didn't do type-based aliasing optimizations, otherwise they hopefully would have defined better intrinsics for movd 32-bit integer loads/stores much earlier than _mm_loadu_si32 in the last few years. You can tell from the void* arg that it's recent: Intel previously did insane stuff like _mm_loadl_epi64(__m128i*) for a movq load, taking a pointer to a 16-byte object but only loading the low 8 bytes (with no alignment requirement).

So a lot of Intel intrinsics stuff seemed pretty casual about C and C++ safety rules, like it was designed by people who thought of C as a portable assembler. Or at least that their intrinsics were supposed to work that way.

As I pointed out in my answer you linked in the question (Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?), Intel's intrinsics API effectively requires compilers to support creating misaligned pointers as long as you don't deref them yourself. Including misaligned float* for _mm_loadu_ps, which supports any alignment, not just multiples of 4.

You could probably argue that supporting Intel's intrinsics API (in a way that's compatible with the examples Intel's published) might not require supporting arbitrary casting between pointer types (without deref), but in practice all x86 compilers do, because they target a flat memory model with byte-addressable memory.

With the existence of intrinsics for gather and scatter, use-cases like using a 0 base with pointer elements for _mm256_i64gather_epi64 (e.g. to walk 4 linked lists in parallel) require that a C++ implementation use a sane object-representation for pointers if they want to support that.

As usual with Intel intrinsics, I don't think there's documentation that 100% nails down proof that it would be safe to use _mm_load_ps on a struct { int a; float b[3]; };, but I think everyone working with intrinsics expects that to be the case. And nobody would want to use a compiler that broke it for a cases where memcpy with the same source pointer would be safe.

But in your case, you don't even need to depend on any de-facto guarantees here, beyond the fact that _mm256_load_ps itself is an aliasing-safe load. You've correctly shown that it's 100% safe to create that float* in ISO C, and pass it to an opaque function.

And yes, deref of an __m256* is exactly equivalent to _mm256_load_ps, and is in fact how most compilers implement _mm256_load_ps.

(By comparison, _mm256_loadu_ps would cast to a pointer to a less-aligned 32-byte vector type which isn't part of the documented API, like GCC's __m256_u*. Or maybe pass it to a builtin function. But however the compiler makes it happen, it's equivalent to a memcpy, including the lack of alignment requirement.)

Memory alignment of Armadillo vectors vec/fvec

The Armadillo do not seems to talk about this point in the documentation so it is left unspecified. Thus, vector data are likely not ensured to be 32-bytes aligned.

However, you do not need vector data to be aligned to load them in AVX registers: you can use the unaligned load intrinsic _mm256_loadu_ps. AFAIK, the performance of _mm256_load_ps and _mm256_loadu_ps is about the same on relatively-new x86 processors.

AVX-512: _mm512_load vs. standard pointer casting?

I am assuming the only difference between the standard pointer casting and the intrinsic is that the intrinsic will immediately load the data onto a 64 byte register while the pointer casting will wait for a further instruction to do so.

Nope, not at all. They're exactly identical, no diff in generated asm. On most compilers, _mm512_load_pd is just a plain inline function that does something like return *(__m512d *) __P; - that's an exact copy-paste from GCC's headers. So a load intrinsic is literally already doing this.

__m512d is not fundamentally different from double or int in terms of how the compiler does register allocation, and decides when to actually load C objects that happen to be in memory. The compiler can fold a load into a later ALU instruction (or optimize it away) regardless of how you write it. (And with AVX-512, may be able to fold _mm512_set1_pd(x) broadcast-loads for instructions with a matching element width.)

The _mm*_load[u]_* intrinsics may look like you're asking for a separate load instruction at that point, but that's not really what happens. That just makes your C look more like asm if you want it to.

Just like memcpy between two int objects can be optimized away or done when it's convenient (as long as the result is as-if it were done in source order), so can store/load intrinsics depending on how you use them. And just like a + operator doesn't have to compile to an add instruction, _mm_add_ps doesn't necessarily have to compile to addps with those exact operands, or to addps at all.

Load/store intrinsics basically exist to communicate alignment guarantees to the compiler (via loadu/storeu), and to take care of types for you (at least for ps and pd load[u]/store[u]; integer still requires casting the pointer). Also for AVX-512, to allow masked loads and masked stores.

Is the code above considered dangerous in any way?

No. Plain dereference is still strict-aliasing safe because __mm* types are special. See

Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

VLD2 structure load of a stricter alignment type

The code must be portable and strictly follow the C90 standard.

... plus everything implied by the presence of ARM NEON intrinsics! (Although that may not help a static analyzer). (Related: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? discusses that for x86, but your case is a bit different; your pointers are aligned).

In C, it's safe to cast between pointer types (without dereferencing) as long as you never create a pointer with insufficient alignment for its type. You don't need a compile-time-visible guarantee of alignment, you just need to not ever actually create a uint16_t* that doesn't have alignof(uint16_t) alignment.

(This makes it unlikely for a static analyzer to complain even if that wasn't the case, unless it could see something like (uint16_t*)(1 + (char*)&something_aligned) where you take an aligned address and offset it by an odd number, which would be guaranteed to produce a misaligned address.)

And in practice, compilers targeting byte-addressable machines do more or less define the behaviour even for creating misaligned pointers. (For example, Intel intrinsics for unaligned loads depend on creating an unaligned __m128i*.) As long as you don't deref them, which is unsafe even in practice on targets that allow unaligned loads; see my answer on this Q&A for an example and the blog links that cover other examples.

So you're 100% fine: your code never creates a misaligned uint16_t*, and doesn't directly dereference it.

If ARM has unaligned-load intrinsics, it would even be safe to form a misaligned uint16_t* and pass it to the function; the existence/design of the intrinsics API implies that it's safe to use it that way.

Other things that are undefined behaviour but which you aren't doing:

It's technically UB to form a pointer that isn't pointing inside an object, or one-past-end, but in practice mainstream implementations allow that as well.
It's strict-aliasing UB to dereference a uint16_t* that doesn't point to uint16_t objects. But any dereferencing only happens inside intrinsic "functions", so you don't have to worry about the strict-aliasing rule. (Which may pointer-cast to some special type and deref, or may pass the pointer on to a __builtin_arm_whatever() compiler built-in.)

I assume that ARM load/store intrinsics are defined similar to memcpy, being able to read/write the bytes of any object. So e.g. you could vld2q_u16 on an array of int, double, or char. Intel intrinsics are defined that way (e.g. GCC/clang use __attribute__((may_alias)).) If not, it wouldn't be safe.

And BTW, the char*-can-alias-anything rule only works one way. Yes it's safe to point a char* at a uint16_t, but if you have an actual array of char buf[100], those objects are definitely char objects, and it's UB to access them through a uint16_t*. However, if you only have char*, and only one other pointer-type other than char* is used, then you can look at the memory as having whatever the other type is, and every char* access aliasing that.

Is 'Reinterpret_Cast'Ing Between Hardware Simd Vector Pointer and the Corresponding Type an Undefined Behavior