Using an Union (Encapsulated in a Struct) to Bypass Conversions for Neon Data Types

Using an union (encapsulated in a struct) to bypass conversions for neon data types

Since the initial proposed method has undefined behaviour in C++, I have implemented something like this:

template <typename T>
struct NeonVectorType {

    private:
    T data;

    public:
    template <typename U>
    operator U () {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to convert to data type of different size");
        U u;
        memcpy( &u, &data, sizeof u );
        return u;
    }

    template <typename U>
    NeonVectorType<T>& operator =(const U& in) {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to copy from data type of different size");
        memcpy( &data, &in, sizeof data );
        return *this;
    }

};

Then:

typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.

The use of memcpy is discussed here (and here), and avoids breaking the strict aliasing rule. Note that in general it gets optimized away.

If you look at the edit history, I had implemented a custom version with combine operators for vectors of vectors (e.g. uint8x8x2_t). The problem was mentioned here. However, since those data types are declared as arrays (see guide, section 12.2.2) and therefore located in consecutive memory locations, the compiler is bound to treat the memcpy correctly.

Finally, to print the content of the variable one could use a function like this.

ARM Neon in C: How to combine different 128bit data types while using intrinsics?

For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.

In some situations, you might want to treat a vector as having a
different type, without changing its value. A set of intrinsics is
provided to perform this type of conversion.

So, assuming a and b are declared as:

uint8x16_t a, b;

Your point 4 can be written as^(*):

b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );

However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?

_{(*) It should be said, this is much more cumbersome of the equivalent (in this specific context) SSE code, as SSE has only one 128 bit integer data type (namely __m128i):}

__m128i b = _mm_srli_si128(a,1);

ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?

Based on your comments, it seems you want to perform a bona fide conversion -- that is, to produce a distinct, new, separate value of a different type. This is a very different thing than a reinterpretation, such as the lead-in to your question suggests you wanted. In particular, you posit variables declared like this:

uint8x16_t  a;
uint8x8x2_t b;

// code to set the value of a ...

and you want to know how to set the value of b so that it is in some sense equivalent to the value of a.

Speaking to the C language:

The strict aliasing rule (C2011 6.5/7) says,

An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:

a type compatible with the effective type of the object, [...]

an aggregate or union type that includes one of the aforementioned types among its members [...], or

a character type.

(Emphasis added. Other enumerated options involve differently-qualified and differently-signed versions of the of the effective type of the object or compatible types; these are not relevant here.)

Note that these provisions never interfere with accessing a's value, including the member value, via variable a, and similarly for b. But don't overlook overlook the usage of the term "effective type" -- this is where things can get bolluxed up under slightly different circumstances. More on that later.

Using a union

C certainly permits you to perform a conversion via an intermediate union, or you could rely on b being a union member in the first place so as to remove the "intermediate" part:

union {
    uint8x16_t  x1;
    uint8x8_2_t x2;
} temp;
temp.x1 = a;
b = temp.x2;

Using a typecast pointer (to produce UB)

However, although it's not so uncommon to see it, C does not permit you to type-pun via a pointer:

// UNDEFINED BEHAVIOR - strict-aliasing violation
    b = *(uint8x8x2_t *)&a;
// DON'T DO THAT

There, you are accessing the value of a, whose effective type is uint8x16_t, via an lvalue of type uint8x8x2_t. Note that it is not the cast that is forbidden, nor even, I'd argue, the dereferencing -- it is reading the dereferenced value so as to apply the side effect of the = operator.

Using `memcpy()`

Now, what about memcpy()? This is where it gets interesting. C permits the stored values of a and b to be accessed via lvalues of character type, and although its arguments are declared to have type void *, this is the only plausible interpretation of how memcpy() works. Certainly its description characterizes it as copying characters. There is therefore nothing wrong with performing a

memcpy(&b, &a, sizeof a);

Having done so, you may freely access the value of b via variable b, as already mentioned. There are aspects of doing so that could be problematic in a more general context, but there's no UB here.

However, contrast this with the superficially similar situation in which you want to put the converted value into dynamically-allocated space:

uint8x8x2_t *c = malloc(sizeof(*c));
memcpy(c, &a, sizeof a);

What could be wrong with that? Nothing is wrong with it, as far as it goes, but here you have UB if you afterward you try to access the value of *c. Why? because the memory to which c points does not have a declared type, therefore its effective type is the effective type of whatever was last stored in it (if that has an effective type), including if that value was copied into it via memcpy() (C2011 6.5/6). As a result, the object to which c points has effective type uint8x16_t after the copy, whereas the expression *c has type uint8x8x2_t; the strict aliasing rule says that accessing that object via that lvalue produces UB.

Translating SSE to Neon: How to pack and then extract 32bit result

I found this excellent guide.
I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.

uint8x8x2_t   vuzp_u8(uint8x8_t a, uint8x8_t b);

So something like:

uint8x16_t a;
uint8_t* out;
[...]

//a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0

a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0

a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0

vst1q_lane_u32(out,a,0);

Last one does not give warning using __attribute__((optimize("lax-vector-conversions")))

But, because of data conversion, the 2 assignments are not possible. One workaround is like this (Edit: This breaks strict aliasing rules! The compiler could assume that a does not change while assigning the address of d.):

uint8x8x2_t* d = (uint8x8x2_t*) &a;
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);

I have implemented a more general workaround through a flexible data type:

NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);

Edit:

Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.

static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
[...]
res = vtbl2_u8(a, MASK);
vst1_lane_u32(out,res,0);

Why does shift right in practice shifts left (and viceversa) in Neon and SSE?

You say "is this because of endianess" but it's more a case of type abuse. You're making assumptions about the bit ordering of the machine across byte/word boundaries and your non-byte instructions that impose local endianess on an operation (you're using an _u32 instruction which expects values that are unsigned 32 bit values, not arrays of 8 bit values).

As you say, you are asking it to shift a series of unsigned char values by /asking/ it to shift values in 32 bit units.

Unfortunately, you are going to need to put them in architecture order if you want to be able to do an architecture shift on them.

Otherwise you may want to look for a blit or move instruction, but you can't artificially coerce machine types into machine registers without paying architectural costs. Endianness will be just one of your headaches (alignment, padding, etc)

--- Late Edit ---

Fundamentally, you are confusing byte and bit shifts, we consider most significant bits to be "left"

bit number
87654321

hex
8421
00008421

00000001  = 0x01 (small, less significant)
10000000  = 0x80 (large, more significant)

But the values you are shifting are 32 bit words, on a little endian machine that means the each subsequent address increases a more significant byte of the value, for a 32 bit word:

bit numbers
                1111111111111111
87654321fedcba0987654321fedcba09

To represent the 32-bit value 0x0001

                1111111111111111
87654321fedcba0987654321fedcba09

00000001000000000000000000000000

To shift it left by 2 positions

00000001000000000000000000000000
     v<
00000100000000000000000000000000

to shift it left by another 8 positions we have to warp it to next address:

00000100000000000000000000000000
      >>>>>>>v
00000000000001000000000000000000

This looks like a right shift if you are thinking in bytes. But we told this little-endian CPU that we were working on a uint32, so that means:

                1111111111111111
87654321fedcba0987654321fedcba09
 word01  word02  word03  word04   
00000001000000000000000000000000 = 0x0001
00000100000000000000000000000000 = 0x0004
00000000000001000000000000000000 = 0x0400

The problem is that this is a different order than the ordering you expect for a local array of 8 bit values, but you told the CPU the values were _u32 so it used it's native endianess for the operation.

Using pointer conversions to store/cast values: Am I breaking the strict aliasing rule?

*((U*) &data) will violate strict aliasing if this is a reinterpret_cast and the type U is not permitted to alias the type T. The permitted types appear in this list.

The rule refers to both reading and writing.

Here is a good article that explains some of the rationale behind the rules.

As noted on the main strict aliasing thread, you can use memcpy as work around , for example:

U u;
memcpy( &u, &data, sizeof u );
return u;

and in the other function

memcpy( &data, &in, sizeof data );

Note that raw byte copies of class types are subject to some restrictions (I think the classes have to be POD, and you'd better be sure they have the same layout).

Using an Union (Encapsulated in a Struct) to Bypass Conversions for Neon Data Types