Avx2 What Is the Most Efficient Way to Pack Left Based on a Mask

AVX2 what is the most efficient way to pack left based on a mask?

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)

We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.

We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.

Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).

AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)

For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).

Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.

For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)

For 16-bit elements, you might be able to adapt this with 128-bit vectors.

For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.

The algorithm:

Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.

Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])

See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.

Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.

By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)

We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.

See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.

#include <stdint.h>
#include <immintrin.h>

// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
  uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
  expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
  // ABC... -> AAAAAAAACCCCCCCC...: replicate each bit to fill its byte

  const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
  uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

  __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
  __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

  return _mm256_permutevar8x32_ps(src, shufmask);
}

This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).

    # clang 3.7.1 -std=gnu++14 -O3 -march=haswell
    mov     eax, edi                   # just to zero extend: goes away when inlining
    movabs  rcx, 72340172838076673     # The constants are hoisted after inlining into a loop
    pdep    rax, rax, rcx              # ABC       -> 0000000A0000000B....
    imul    rax, rax, 255              # 0000000A0000000B.. -> AAAAAAAA..
    movabs  rcx, 506097522914230528
    pext    rax, rcx, rax
    vmovq   xmm1, rax
    vpmovzxbd       ymm1, xmm1         # 3c latency since this is lane-crossing
    vpermps ymm0, ymm1, ymm0
    ret

(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)

So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.

This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.

e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.

clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).

gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.

Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.

If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.

You can unpack LUT entries with the same integer sequence I used, but @Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

Best way to mask a single bit in AVX2?

Here is an approach using variable shifts (just creating the mask):

__m256i create_mask(unsigned i) {
    __m256i ii = _mm256_set1_epi32(i);
    ii = _mm256_sub_epi32(ii,_mm256_setr_epi32(0,32,64,96,128,160,192,224));
    __m256i mask = _mm256_sllv_epi32(_mm256_set1_epi32(1), ii);
    return mask;
}

_mm256_sllv_epi32 (vpsllvd) was introduced by AVX2 and it shifts each 32 bit element by a variable amount of bits. If the (unsigned) shift-amount is bigger than 31 (i.e., also for signed negative numbers), the corresponding result is 0.

Godbolt link with small test code: https://godbolt.org/z/a5xfqTcGs

Efficient sse shuffle mask generation for left-packing byte elements

Assuming:

change1 = _mm_movemask_epi8(bytemask);
offset = popcnt(change1);

On large buffers, using two shuffles and a 1 KiB table is only ~10% slower than using 1 shuffle and a 1MiB table. My attempts at generating the shuffle mask via prefix sums and bit twiddling are about about half the speed of the table based methods
(solutions using pext/pdep were not explored).

Reducing table size: Use two lookups into a 2 KiB table instead of 1 lookup into a 1 MiB table. Always keep the top-most byte - if that byte is to be discarded then it doesn't matter what byte is at that position (down to 7-bit indices, or 1 KiB table). Further reduce possible combinations by manually packing the two bytes in each 16-bit lane (down to a 216 byte table).

The following example strips whitespace from text using SSE4.1. If only SSSE3 is available then blendv can be emulated. The 64-bit halves are re-combined by overlapping writes to memory, but they could be re-combined in the xmm register (as seen in the AVX2 example).

#include <stdint.h>
#include <smmintrin.h> // SSE4.1

size_t despacer (void* dst_void, void* src_void, size_t length)
{
    uint8_t* src = (uint8_t*)src_void;
    uint8_t* dst = (uint8_t*)dst_void;

    if (length >= 16) {
        // table of control characters (space, tab, newline, carriage return)
        const __m128i lut_cntrl = _mm_setr_epi8(' ', 0, 0, 0, 0, 0, 0, 0, 0, '\t', '\n', 0, 0, '\r', 0, 0);

        // bits[4:0] = index -> ((trit_d * 0) + (trit_c * 9) + (trit_b * 3) + (trit_a * 1))
        // bits[15:7] = popcnt
        const __m128i sadmask = _mm_set1_epi64x(0x8080898983838181);

        // adding 8 to each shuffle index is cheaper than extracting the high qword
        const __m128i offset = _mm_cvtsi64_si128(0x0808080808080808);

        // shuffle control indices
        static const uint64_t table[27] = {
            0x0000000000000706, 0x0000000000070600, 0x0000000007060100, 0x0000000000070602,
            0x0000000007060200, 0x0000000706020100, 0x0000000007060302, 0x0000000706030200,
            0x0000070603020100, 0x0000000000070604, 0x0000000007060400, 0x0000000706040100,
            0x0000000007060402, 0x0000000706040200, 0x0000070604020100, 0x0000000706040302,
            0x0000070604030200, 0x0007060403020100, 0x0000000007060504, 0x0000000706050400,
            0x0000070605040100, 0x0000000706050402, 0x0000070605040200, 0x0007060504020100,
            0x0000070605040302, 0x0007060504030200, 0x0706050403020100
        };

        const uint8_t* end = &src[length & ~15];
        do {
            __m128i v = _mm_loadu_si128((__m128i*)src);
            src += 16;

            // detect spaces
            __m128i mask = _mm_cmpeq_epi8(_mm_shuffle_epi8(lut_cntrl, v), v);

            // shift w/blend: each word now only has 3 states instead of 4
            // which reduces the possiblities per qword from 128 to 27
            v = _mm_blendv_epi8(v, _mm_srli_epi16(v, 8), mask);

            // extract bitfields describing each qword: index, popcnt
            __m128i desc = _mm_sad_epu8(_mm_and_si128(mask, sadmask), sadmask);
            size_t lo_desc = (size_t)_mm_cvtsi128_si32(desc);
            size_t hi_desc = (size_t)_mm_extract_epi16(desc, 4);

            // load shuffle control indices from pre-computed table
            __m128i lo_shuf = _mm_loadl_epi64((__m128i*)&table[lo_desc & 0x1F]);
            __m128i hi_shuf = _mm_or_si128(_mm_loadl_epi64((__m128i*)&table[hi_desc & 0x1F]), offset);

            // store an entire qword then advance the pointer by how ever
            // many of those bytes are actually wanted. Any trailing
            // garbage will be overwritten by the next store.
            // note: little endian byte memory order
            _mm_storel_epi64((__m128i*)dst, _mm_shuffle_epi8(v, lo_shuf));
            dst += (lo_desc >> 7);
            _mm_storel_epi64((__m128i*)dst, _mm_shuffle_epi8(v, hi_shuf));
            dst += (hi_desc >> 7);
        } while (src != end);
    }

    // tail loop
    length &= 15;
    if (length != 0) {
        const uint64_t bitmap = 0xFFFFFFFEFFFFC1FF;
        do {
            uint64_t c = *src++;
            *dst = (uint8_t)c;
            dst += ((bitmap >> c) & 1) | ((c + 0xC0) >> 8);
        } while (--length);
    }

    // return pointer to the location after the last element in dst
    return (size_t)(dst - ((uint8_t*)dst_void));
}

Whether the tail loop should be vectorized or use cmov is left as an exercise for the reader. Writing each byte unconditionally/branchlessly is fast when the input is unpredictable.

Using AVX2 to generate the shuffle control mask using an in-register table is only slightly slower than using large precomputed tables.

#include <stdint.h>
#include <immintrin.h>

// probably needs improvment...
size_t despace_avx2_vpermd(const char* src_void, char* dst_void, size_t length)
{
    uint8_t* src = (uint8_t*)src_void;
    uint8_t* dst = (uint8_t*)dst_void;

    const __m256i lut_cntrl2    = _mm256_broadcastsi128_si256(_mm_setr_epi8(' ', 0, 0, 0, 0, 0, 0, 0, 0, '\t', '\n', 0, 0, '\r', 0, 0));
    const __m256i permutation_mask = _mm256_set1_epi64x( 0x0020100884828180 );
    const __m256i invert_mask = _mm256_set1_epi64x( 0x0020100880808080 ); 
    const __m256i zero = _mm256_setzero_si256();
    const __m256i fixup = _mm256_set_epi32(
        0x08080808, 0x0F0F0F0F, 0x00000000, 0x07070707,
        0x08080808, 0x0F0F0F0F, 0x00000000, 0x07070707
    );
    const __m256i lut = _mm256_set_epi32(
        0x04050607, // 0x03020100', 0x000000'07
        0x04050704, // 0x030200'00, 0x0000'0704
        0x04060705, // 0x030100'00, 0x0000'0705
        0x04070504, // 0x0300'0000, 0x00'070504
        0x05060706, // 0x020100'00, 0x0000'0706
        0x05070604, // 0x0200'0000, 0x00'070604
        0x06070605, // 0x0100'0000, 0x00'070605
        0x07060504  // 0x00'000000, 0x'07060504
    );

    // hi bits are ignored by pshufb, used to reject movement of low qword bytes
    const __m256i shuffle_a = _mm256_set_epi8(
        0x7F, 0x7E, 0x7D, 0x7C, 0x7B, 0x7A, 0x79, 0x78, 0x07, 0x16, 0x25, 0x34, 0x43, 0x52, 0x61, 0x70,
        0x7F, 0x7E, 0x7D, 0x7C, 0x7B, 0x7A, 0x79, 0x78, 0x07, 0x16, 0x25, 0x34, 0x43, 0x52, 0x61, 0x70
    );

    // broadcast 0x08 then blendd...
    const __m256i shuffle_b = _mm256_set_epi32(
        0x08080808, 0x08080808, 0x00000000, 0x00000000,
        0x08080808, 0x08080808, 0x00000000, 0x00000000
    );

    for( uint8_t* end = &src[(length & ~31)]; src != end; src += 32){
        __m256i r0,r1,r2,r3,r4;
        unsigned int s0,s1;

        r0 = _mm256_loadu_si256((__m256i *)src); // asrc

        // detect spaces
        r1 = _mm256_cmpeq_epi8(_mm256_shuffle_epi8(lut_cntrl2, r0), r0);

        r2 = _mm256_sad_epu8(zero, r1);
        s0 = (unsigned)_mm256_movemask_epi8(r1);
        r1 = _mm256_andnot_si256(r1, permutation_mask);

        r1 = _mm256_sad_epu8(r1, invert_mask); // index_bitmap[0:5], low32_spaces_count[7:15]

        r2 = _mm256_shuffle_epi8(r2, zero);

        r2 = _mm256_sub_epi8(shuffle_a, r2); // add space cnt of low qword
        s0 = ~s0;

        r3 = _mm256_slli_epi64(r1, 29); // move top part of index_bitmap to high dword
        r4 = _mm256_srli_epi64(r1, 7); // number of spaces in low dword 

        r4 = _mm256_shuffle_epi8(r4, shuffle_b);
        r1 = _mm256_or_si256(r1, r3);

        r1 = _mm256_permutevar8x32_epi32(lut, r1);
        s1 = _mm_popcnt_u32(s0);
        r4 = _mm256_add_epi8(r4, shuffle_a);
        s0 = s0 & 0xFFFF; // isolate low oword

        r2 = _mm256_shuffle_epi8(r4, r2);
        s0 = _mm_popcnt_u32(s0);

        r2 = _mm256_max_epu8(r2, r4); // pin low qword bytes

        r1 = _mm256_xor_si256(r1, fixup);

        r1 = _mm256_shuffle_epi8(r1, r2); // complete shuffle mask

        r0 = _mm256_shuffle_epi8(r0, r1); // despace!

        _mm_storeu_si128((__m128i*)dst, _mm256_castsi256_si128(r0));
        _mm_storeu_si128((__m128i*)&dst[s0], _mm256_extracti128_si256(r0,1));
        dst += s1;
    }
    // tail loop
    length &= 31;
    if (length != 0) {
        const uint64_t bitmap = 0xFFFFFFFEFFFFC1FF;
        do {
            uint64_t c = *src++;
            *dst = (uint8_t)c;
            dst += ((bitmap >> c) & 1) | ((c + 0xC0) >> 8);
        } while (--length);
    }
    return (size_t)(dst - ((uint8_t*)dst_void));
}

For posterity, the 1 KiB version (generating the table is left as an exercise for the reader).

static const uint64_t table[128] __attribute__((aligned(64))) = {
    0x0706050403020100, 0x0007060504030201, ..., 0x0605040302010700, 0x0605040302010007 
};
const __m128i mask_01 = _mm_set1_epi8( 0x01 );

__m128i vector0 = _mm_loadu_si128((__m128i*)src);
__m128i vector1 = _mm_shuffle_epi32( vector0, 0x0E );

__m128i bytemask0 = _mm_cmpeq_epi8( ???, vector0); // detect bytes to omit

uint32_t bitmask0 = _mm_movemask_epi8(bytemask0) & 0x7F7F;
__m128i hsum = _mm_sad_epu8(_mm_add_epi8(bytemask0, mask_01), _mm_setzero_si128());

vector0 = _mm_shuffle_epi8(vector0, _mm_loadl_epi64((__m128i*) &table[(uint8_t)bitmask0]));
_mm_storel_epi64((__m128i*)dst, vector0);
dst += (uint32_t)_mm_cvtsi128_si32(hsum);

vector1 = _mm_shuffle_epi8(vector1, _mm_loadl_epi64((__m128i*) &table[bitmask0 >> 8]));
_mm_storel_epi64((__m128i*)dst, vector1);
dst += (uint32_t)_mm_cvtsi128_si32(_mm_unpackhi_epi64(hsum, hsum));

https://github.com/InstLatx64/AVX512_VPCOMPRESSB_Emu has some benchmarks.