How to Implement "_Mm_Storeu_Epi64" Without Aliasing Problems

How to implement _mm_storeu_epi64 without aliasing problems?

SSE intrinsics is one of those niche corner cases where you have to push the rules a bit.

Since these intrinsics are compiler extensions (somewhat standardized by Intel), they are already outside the specification of the C and C++ language standards. So it's somewhat self-defeating to try to be "standard compliant" while using a feature that clearly is not.

Despite the fact that the SSE intrinsic libraries try to act like normal 3rd party libraries, underneath, they are all specially handled by the compiler.


The Intent:

The SSE intrinsics were likely designed from the beginning to allow aliasing between the vector and scalar types - since a vector really is just an aggregate of the scalar type.

But whoever designed the SSE intrinsics probably wasn't a language pedant.
(That's not too surprising. Hard-core low-level performance programmers and language lawyering enthusiasts tend to be very different groups of people who don't always get along.)

We can see evidence of this in the load/store intrinsics:

  • __m128i _mm_stream_load_si128(__m128i* mem_addr) - A load intrinsic that takes a non-const pointer?
  • void _mm_storeu_pd(double* mem_addr, __m128d a) - What if I want to store to __m128i*?

The strict aliasing problems are a direct result of these poor prototypes.

Starting from AVX512, the intrinsics have all been converted to void* to address this problem:

  • __m512d _mm512_load_pd(void const* mem_addr)
  • void _mm512_store_epi64 (void* mem_addr, __m512i a)

Compiler Specifics:

  • Visual Studio defines each of the SSE/AVX types as a union of the scalar types. This by itself allows strict-aliasing. Furthermore, Visual Studio doesn't do strict-aliasing so the point is moot:

  • The Intel Compiler has never failed me with all sorts of aliasing. It probably doesn't do strict-aliasing either - though I've never found any reliable source for this.

  • GCC does do strict-aliasing, but from my experience, not across function boundaries. It has never failed me to cast pointers which are passed in (on any type). GCC also declares SSE types as __may_alias__ thereby explicitly allowing it to alias other types.


My Recommendation:

  • For function parameters that are of the wrong pointer type, just cast it.
  • For variables declared and aliased on the stack, use a union. That union will already be aligned so you can read/write to them directly without intrinsics. (But be aware of store-forwarding issues that come with interleaving vector/scalar accesses.)
  • If you need to access a vector both as a whole and by its scalar components, consider using insert/extract intrinsics instead of aliasing.
  • When using GCC, turn on -Wall or -Wstrict-aliasing. It will tell you about strict-aliasing violations.

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

1 that’s not an oversight. That instruction indeed shifts by bytes, i.e. multiples of 8 bits.

2 doesn’t matter, _mm_slli_si128 and _mm_bslli_si128 are equivalents, both compile into pslldq SSE2 instruction.

As for the emulation, I’d do it like that, assuming you have C++/17. If you’re writing C++/14, replace if constexpr with normal if, also add a message to the static_assert.

template<int i>
inline __m128i shiftLeftBits( __m128i vec )
{
static_assert( i >= 0 && i < 128 );
// Handle couple trivial cases
if constexpr( 0 == i )
return vec;
if constexpr( 0 == ( i % 8 ) )
return _mm_slli_si128( vec, i / 8 );

if constexpr( i > 64 )
{
// Shifting by more than 8 bytes, the lowest half will be all zeros
vec = _mm_slli_si128( vec, 8 );
return _mm_slli_epi64( vec, i - 64 );
}
else
{
// Shifting by less than 8 bytes.
// Need to propagate a few bits across 64-bit lanes.
__m128i low = _mm_slli_si128( vec, 8 );
__m128i high = _mm_slli_epi64( vec, i );
low = _mm_srli_epi64( low, 64 - i );
return _mm_or_si128( low, high );
}
}


Related Topics



Leave a reply



Submit