Emulating Shifts on 32 Bytes with Avx

Emulating shifts on 32 bytes with AVX

From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8.

_mm256_slli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 - N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0))

16 < N < 32

_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), N - 16)

_mm256_srli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), A, N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1))

16 < N < 32

_mm256_srli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), N - 16)

Emulating shifts on 64 bytes with AVX-512

Here is a working solution using a temporary array:

__m512i _mm512_slri_si512(__m512i a, size_t imm8)
{
    // set up temporary array and set upper half to zero 
    // (this needs to happen outside any critical loop)
    alignas(64) char temp[128];
    _mm512_store_si512(temp+64, _mm512_setzero_si512());

    // store input into lower half
    _mm512_store_si512(temp, a);

    // load shifted register
    return _mm512_loadu_si512(temp+imm8);
}

__m512i _mm512_slli_si512(__m512i a, size_t imm8)
{
    // set up temporary array and set lower half to zero 
    // (this needs to happen outside any critical loop)
    alignas(64) char temp[128];
    _mm512_store_si512(temp, _mm512_setzero_si512());

    // store input into upper half
    _mm512_store_si512(temp+64, a);

    // load shifted register
    return _mm512_loadu_si512(temp+(64-imm8));
}

This should also work if imm8 was not known at compile time, but it does not do any out-of-bounds checks.
You could actually use a 3*64 temporary and share it between the left and right shift methods (and both would work for negative inputs as well).

Of course, if you share a temporary outside the function body, you must make sure that it is not accessed by multiple threads at once.

Godbolt-Link with usage demonstration: https://godbolt.org/z/LSgeWZ

As Peter noted, this store-load trick will cause a store-forwarding stall on all CPUs with AVX512. The most-efficient forwarding case (~6 cycle latency) only works when all the load bytes come from one store. If the load goes outside the most recent store that overlaps it at all, it has extra latency (like ~16 cycles) to scan the store buffer and if needed merge in bytes from L1d cache. See Can modern x86 implementations store-forward from more than one prior store? and Agner Fog's microarch guidefor more details. This extra-scanning process can probably be happening for multiple loads in parallel, and at least doesn't stall other things (like normal store-forwarding or the rest of the pipeline), so it may not be a throughput problem.

If you want many shift offsets of the same data, one store and multiple reloads at different alignments should be good.

But if latency is your primary issue you should try a solution based on valignd (also, if you want to shift by a multiple of 4 bytes that is obviously an easier solution). Or for constant shift-counts, a vector control for vpermw could work.

For completeness, here is a version based on valignd and valignr working for shifts from 0 to 64, known at compile-time (using C++17 -- but you can easily avoid the if constexpr this is only here because of the static_assert). Instead of shifting in zeros you can pass a second register (i.e., it behaves like valignr would behave if it would align across lanes).

template<int N>
__m512i shift_right(__m512i a, __m512i carry = _mm512_setzero_si512())
{
  static_assert(0 <= N && N <= 64);
  if constexpr(N   == 0) return a;
  if constexpr(N   ==64) return carry;
  if constexpr(N%4 == 0) return _mm512_alignr_epi32(carry, a, N / 4);
  else
  {
    __m512i a0 = shift_right< (N/16 + 1)*16>(a, carry);  // 16, 32, 48, 64
    __m512i a1 = shift_right< (N/16    )*16>(a, carry);  //  0, 16, 32, 48
    return _mm512_alignr_epi8(a0, a1, N % 16);
  }
}

template<int N>
__m512i shift_left(__m512i a, __m512i carry = _mm512_setzero_si512())
{
  return shift_right<64-N>(carry, a);
}

Here is a godbolt-link with some example assembly as well as output for every possible shift_right operation: https://godbolt.org/z/xmKJvA

GCC faithfully translates this into valignd and valignr instructions -- but may do an unnecessary vpxor instruction (e.g. in the shiftleft_49 example), Clang does some crazy substitutions (not sure if they actually make a difference, though).

The code could be extended to shift an arbitrary sequence of registers (always carrying bytes from the previous register).

8 bit shift operation in AVX2 with shifting in zeros

okay I implemented a function that can shift left up to 16 byte.

template  <unsigned int N> __m256i _mm256_shift_left(__m256i a)
{
  __m256i mask =  _mm256_srli_si256(
          _mm256_permute2x128_si256(a, a, _MM_SHUFFLE(0,0,3,0))
          , 16-N);
  return _mm256_or_si256(_mm256_slli_si256(a,N),mask);
}

Example:

int main(int argc, char* argv[]) {
   __m256i reg =  _mm256_set_epi8(32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,
                                  14,13,12,11,10,9,8,7,6,5,4,3,2,1);

   __m256i result = _mm256_shift_left<1>(reg);
   for(int i = 0; i < 32; i++)
     printf("%2d ",((unsigned char *)&result)[i]);
   printf("\n");
}

The output is

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Edit: New version with new alignr instruction.
Thanks for the hint @Evgney Kluev

template  <unsigned int N> __m256i _mm256_shift_left(__m256i a)
{
  __m256i mask = _mm256_permute2x128_si256(a, a, _MM_SHUFFLE(0,0,3,0) );
  return _mm256_alignr_epi8(a,mask,16-N);
}

AVX2 VPSHUFB emulation in AVX

As @MaratDukhan has noticed, _mm256_shuffle_epi8 (i.e. VPSHUFB for ymm-s) does not perform full 32-byte shuffle. As for me, it is quite a pity...

That's why in order to emulate it without AVX2 you can simply split each register into two halves, permute each half, then combine together:

//AVX only
__m256i _emu_mm256_shuffle_epi8(__m256i reg, __m256i shuf) {
    __m128i reg0 = _mm256_castsi256_si128(reg);
    __m128i reg1 = _mm256_extractf128_si256(reg, 1);
    __m128i shuf0 = _mm256_castsi256_si128(shuf);
    __m128i shuf1 = _mm256_extractf128_si256(shuf, 1);
    __m128i res0 = _mm_shuffle_epi8(reg0, shuf0);
    __m128i res1 = _mm_shuffle_epi8(reg1, shuf1);
    __m256i res = _mm256_setr_m128i(res0, res1);
    return res;
}

If you really want to fully shuffle the 32-byte register, you can follow approach from this paper. Shuffle each half with each half, then blend results together. Without AVX2 it would be something like that:

//AVX only
__m256i _emu_mm256_shuffle32_epi8(__m256i reg, __m256i shuf) {
    __m128i reg0 = _mm256_castsi256_si128(reg);
    __m128i reg1 = _mm256_extractf128_si256(reg, 1);
    __m128i shuf0 = _mm256_castsi256_si128(shuf);
    __m128i shuf1 = _mm256_extractf128_si256(shuf, 1);
    __m128i res00 = _mm_shuffle_epi8(reg0, shuf0);
    __m128i res01 = _mm_shuffle_epi8(reg0, shuf1);
    __m128i res10 = _mm_shuffle_epi8(reg1, shuf0);
    __m128i res11 = _mm_shuffle_epi8(reg1, shuf1);
    __m128i res0 = _mm_blendv_epi8(res10, res00, _mm_cmplt_epi8(shuf0, _mm_set1_epi8(16)));
    __m128i res1 = _mm_blendv_epi8(res11, res01, _mm_cmplt_epi8(shuf1, _mm_set1_epi8(16)));
    __m256i res = _mm256_setr_m128i(res0, res1);
    return res;
}

If you know for sure that only the lower half of reg is used, then you can remove lines for reg1, res10, res11, and remove comparison and blending. Indeed, it might be more efficient to stick with SSE and use 128-bit registers if you have no AVX2.

The general 32-byte shuffling can be significantly optimized with AVX2:

//Uses AVX2
__m256i _ext_mm256_shuffle32_epi8(__m256i reg, __m256i shuf) {
    __m256i regAll0 = _mm256_permute2x128_si256(reg, reg, 0x00);
    __m256i regAll1 = _mm256_permute2x128_si256(reg, reg, 0x11);
    __m256i resR0 = _mm256_shuffle_epi8(regAll0, shuf);
    __m256i resR1 = _mm256_shuffle_epi8(regAll1, shuf);
    __m256i res = _mm256_blendv_epi8(resR1, resR0, _mm256_cmpgt_epi8(_mm256_set1_epi8(16), shuf));
    return res;
}

Beware: code not tested!

Fastest way to unpack 32 bits to a 32 byte SIMD vector

To "broadcast" the 32 bits of a 32-bit integer x to 32 bytes of a 256-bit YMM register z or 16 bytes of a two 128-bit XMM registers z_low and z_high you can do the following.

With AVX2:

__m256i y = _mm256_set1_epi32(x);
__m256i z = _mm256_shuffle_epi8(y,mask1);
z = _mm256_and_si256(z,mask2);

Without AVX2 it's best to do this with SSE:

__m128i y = _mm_set1_epi32(x);      
__m128i z_low  = _mm_shuffle_epi8(y,mask_low);
__m128i z_high = _mm_shuffle_epi8(y,mask_high); 
z_low  = _mm_and_si128(z_low ,mask2);
z_high = _mm_and_si128(z_high,mask2);

The masks and a working example are shown below. If you plan to do this several times you should probably
define the masks outside of the main loop.

#include <immintrin.h>
#include <stdio.h>

int main() {
    int x = 0x87654321;

    static const char mask1a[32] = {
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x01, 0x01, 0x01, 0x01,
        0x01, 0x01, 0x01, 0x01,
        0x02, 0x02, 0x02, 0x02,
        0x02, 0x02, 0x02, 0x02,
        0x03, 0x03, 0x03, 0x03,
        0x03, 0x03, 0x03, 0x03
    };

    static const char mask2a[32] = {
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
    };

char out[32];

#if defined ( __AVX2__ )
    __m256i mask2 = _mm256_loadu_si256((__m256i*)mask2a);
    __m256i mask1  = _mm256_loadu_si256((__m256i*)mask1a);

    __m256i y =    _mm256_set1_epi32(x);
    __m256i z =    _mm256_shuffle_epi8(y,mask1);
    z = _mm256_and_si256(z,mask2);

    _mm256_storeu_si256((__m256i*)out,z);

#else
    __m128i mask2 = _mm_loadu_si128((__m128i*)mask2a);
    __m128i mask_low  = _mm_loadu_si128((__m128i*)&mask1a[ 0]);
    __m128i mask_high = _mm_loadu_si128((__m128i*)&mask1a[16]);    

    __m128i y = _mm_set1_epi32(x); 
    __m128i z_low  = _mm_shuffle_epi8(y,mask_low);
    __m128i z_high = _mm_shuffle_epi8(y,mask_high);
    z_low  = _mm_and_si128(z_low,mask2);
    z_high = _mm_and_si128(z_high,mask2);

    _mm_storeu_si128((__m128i*)&out[ 0],z_low);
    _mm_storeu_si128((__m128i*)&out[16],z_high);
#endif
    for(int i=0; i<8; i++) {
        for(int j=0; j<4; j++) {        
            printf("%x ", out[4*i+j]);
        }printf("\n");
    } printf("\n");
}

To get 0 or -1 in each vector element:

It takes one extra step _mm256_cmpeq_epi8 against all-zeros. Any non-zero turns into 0, and zero turns into -1. If we don't want this inversion, use andnot instead of and. It inverts its first operand.

__m256i expand_bits_to_bytes(uint32_t x)
{
    __m256i xbcast = _mm256_set1_epi32(x);    // we only use the low 32bits of each lane, but this is fine with AVX2

    // Each byte gets the source byte containing the corresponding bit
    __m256i shufmask = _mm256_set_epi64x(
        0x0303030303030303, 0x0202020202020202,
        0x0101010101010101, 0x0000000000000000);
    __m256i shuf  = _mm256_shuffle_epi8(xbcast, shufmask);

    __m256i andmask  = _mm256_set1_epi64x(0x8040201008040201);  // every 8 bits -> 8 bytes, pattern repeats.
    __m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);

    // this is the extra step: compare each byte == 0 to produce 0 or -1
    return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
     // alternative: compare against the AND mask to get 0 or -1,
     // avoiding the need for a vector zero constant.
}

See it on the Godbolt Compiler Explorer.

Also see is there an inverse instruction to the movemask instruction in intel avx2? for other element sizes.

AVX divide __m256i packed 32-bit integers by two (no AVX2)

Assuming you know what you’re doing, here’s that function.

inline __m256i div2_epi32( __m256i vec )
{
    // Split the 32-byte vector into 16-byte ones
    __m128i low = _mm256_castsi256_si128( vec );
    __m128i high = _mm256_extractf128_si256( vec, 1 );
    // Shift the lanes within each piece; replace with _mm_srli_epi32 for unsigned version
    low = _mm_srai_epi32( low, 1 );
    high = _mm_srai_epi32( high, 1 );
    // Combine back into 32-byte vector
    vec = _mm256_castsi128_si256( low );
    return _mm256_insertf128_si256( vec, high, 1 );
}

However, doing that is not necessarily faster than dealing with 16-byte vectors. On most CPUs, the performance of these insert/extract instructions ain’t great, except maybe AMD Zen 1 CPU.

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

1 that’s not an oversight. That instruction indeed shifts by bytes, i.e. multiples of 8 bits.

2 doesn’t matter, _mm_slli_si128 and _mm_bslli_si128 are equivalents, both compile into pslldq SSE2 instruction.

As for the emulation, I’d do it like that, assuming you have C++/17. If you’re writing C++/14, replace if constexpr with normal if, also add a message to the static_assert.

template<int i>
inline __m128i shiftLeftBits( __m128i vec )
{
    static_assert( i >= 0 && i < 128 );
    // Handle couple trivial cases
    if constexpr( 0 == i )
        return vec;
    if constexpr( 0 == ( i % 8 ) )
        return _mm_slli_si128( vec, i / 8 );

    if constexpr( i > 64 )
    {
        // Shifting by more than 8 bytes, the lowest half will be all zeros
        vec = _mm_slli_si128( vec, 8 );
        return _mm_slli_epi64( vec, i - 64 );
    }
    else
    {
        // Shifting by less than 8 bytes.
        // Need to propagate a few bits across 64-bit lanes.
        __m128i low = _mm_slli_si128( vec, 8 );
        __m128i high = _mm_slli_epi64( vec, i );
        low = _mm_srli_epi64( low, 64 - i );
        return _mm_or_si128( low, high );
    }
}

Emulating Shifts on 32 Bytes with Avx