Loading 8 Chars from Memory into an _M256 Variable as Packed Single Precision Floats

Loading 8 chars from memory into an __m256 variable as packed single precision floats

If you're using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place.

; rsi = new_image
VPMOVZXBD   ymm0,  [rsi]   ; or SX to sign-extend  (Byte to DWord)
VCVTDQ2PS   ymm0, ymm0     ; convert to packed foat

This is a good strategy even if you want to do this for multiple vectors, but even better might be a 128-bit broadcast load to feed vpmovzxbd ymm,xmm and vpshufb ymm (_mm256_shuffle_epi8) for the high 64 bits, because Intel SnB-family CPUs don't micro-fuse a vpmovzx ymm,mem, only only vpmovzx xmm,mem. (https://agner.org/optimize/). Broadcast loads are single uop with no ALU port required, running purely in a load port. So this is 3 total uops to bcast-load + vpmovzx + vpshufb.

(TODO: write an intrinsics version of that. It also sidesteps the problem of missed optimizations for _mm_loadl_epi64 -> _mm256_cvtepu8_epi32.)

Of course this requires a shuffle control vector in another register, so it's only worth it if you can use that multiple times.

vpshufb is usable because the data needed for each lane is there from the broadcast, and the high bit of the shuffle-control will zero the corresponding element.

This broadcast + shuffle strategy might be good on Ryzen; Agner Fog doesn't list uop counts for vpmovsx/zx ymm on it.

Do not do something like a 128-bit or 256-bit load and then shuffle that to feed further vpmovzx instructions. Total shuffle throughput will probably already be a bottleneck because vpmovzx is a shuffle. Intel Haswell/Skylake (the most common AVX2 uarches) have 1-per-clock shuffles but 2-per-clock loads. Using extra shuffle instructions instead of folding separate memory operands into vpmovzxbd is terrible. Only if you can reduce total uop count like I suggested with broadcast-load + vpmovzxbd + vpshufb is it a win.

My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant for converting back to uint8_t. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb, because they work in-lane, unlike vpmovzx.

With only AVX1, not AVX2, you should do:

VPMOVZXBD   xmm0,  [rsi]
VPMOVZXBD   xmm1,  [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1   ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS   ymm0, ymm0     ; convert to packed float.  Yes, works without AVX2

You of course never need an array of float, just __m256 vectors.

GCC / MSVC missed optimizations for VPMOVZXBD ymm,[mem] with intrinsics

GCC and MSVC are bad at folding a _mm_loadl_epi64 into a memory operand for vpmovzx*. (But at least there is a load intrinsic of the right width, unlike for pmovzxbq xmm, word [mem].)

We get a vmovq load and then a separate vpmovzx with an XMM input. (With ICC and clang3.6+ we get safe + optimal code from using _mm_loadl_epi64, like from gcc9+)

But gcc8.3 and earlier can fold a _mm_loadu_si128 16-byte load intrinsic into an 8-byte memory operand. This gives optimal asm at -O3 on GCC, but is unsafe at -O0 where it compiles to an actual vmovdqu load that touches more data that we actually load, and could go off the end of a page.

Two gcc bugs submitted because of this answer:

SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx (fixed for gcc9, but the fix breaks load folding for a 128-bit load so the workaround hack for old GCC makes gcc9 do worse.)
No intrinsic for x86 MOVQ m64, %xmm in 32bit mode. (TODO: report this for clang/LLVM as well?)

There's no intrinsic to use SSE4.1 pmovsx / pmovzx as a load, only with a __m128i source operand. But the asm instructions only read the amount of data they actually use, not a 16-byte __m128i memory source operand. Unlike punpck*, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).

So here's the evil solution I've come up with. Don't use this, #ifdef __OPTIMIZE__ is Bad, making it possible to create bugs that only happen in the debug build or only in the optimized build!

#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif

__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef  USE_MOVQ  // compiles to an actual movq then movzx ymm, xmm with gcc8.3 -O3
    __m128i small_load = _mm_loadl_epi64( (const __m128i*)p);
#else  // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
    __m128i small_load = _mm_loadu_si128( (const __m128i*)p );
#endif

    __m256i intvec = _mm256_cvtepu8_epi32( small_load );
    //__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p );  // compiles to an aligned load with -O0
    return _mm256_cvtepi32_ps(intvec);
}

With USE_MOVQ enabled, gcc -O3 (v5.3.0) emits. (So does MSVC)

load_bytes_to_m256(unsigned char*):
        vmovq   xmm0, QWORD PTR [rdi]
        vpmovzxbd       ymm0, xmm0
        vcvtdq2ps       ymm0, ymm0
        ret

The stupid vmovq is what we want to avoid. If you let it use the unsafe loadu_si128 version, it will make good optimized code.

GCC9, clang, and ICC emit:

load_bytes_to_m256(unsigned char*): 
        vpmovzxbd       ymm0, qword ptr [rdi] # ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
        vcvtdq2ps       ymm0, ymm0
        ret

Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for "instructions", not "intrinsics", and this is one place where there's a gap in the intrinsics. Having to use _mm_cvtsi64_si128 to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don't want is pretty dumb.

Also note that if you're looking in the Intel insn ref manual, there are two separate entries for movq:

movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E (or VEX.128.66.0F.W1 6E) for (V)MOVQ xmm, r/m64). That's where you'll find the intrinsic that can accept a 64-bit integer, _mm_cvtsi64_si128. (Some compilers don't define it in 32-bit mode.)
movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E (VEX.128.F3.0F.WIG 7E) for MOVQ xmm, xmm/m64).
The asm ISA ref manual only lists the m128i _mm_mov_epi64(__m128i a) intrinsic for zeroing the high 64b of a vector while copying it. But the intrinsics guide does list _mm_loadl_epi64(__m128i const* mem_addr) which has a stupid prototype (pointer to a 16-byte __m128i type when it really only loads 8 bytes). It is available on all 4 of the major x86 compilers, and should actually be safe. Note that the __m128i* is just passed to this opaque intrinsic, not actually dereferenced.
The more sane _mm_loadu_si64 (void const* mem_addr) is also listed, but gcc is missing that one.

Load and duplicate 4 single precision float numbers into a packed __m256 variable with fewest instructions

If your data was the result of another vector calculation (and in a __m128), you'd want AVX2 vpermps (_mm256_permutexvar_ps) with a control vector of _mm256_set_epi32(3,3, 2,2, 1,1, 0,0).

vpermps ymm is 1 uop on Intel, but 2 uops on Zen2 (with 2 cycle throughput). And 3 uops on Zen1 with one per 4 clock throughput. (https://uops.info/)

If it was the result of separate scalar calculations, you might want to shuffle them together with _mm_set_ps(d,d, c,c) (1x vshufps) to set up for a vinsertf128.

But with data in memory, I think your best bet is a 128-bit broadcast-load, then an in-lane shuffle. It only requires AVX1, and on modern CPUs it's 1 load + 1 shuffle uop on Zen2 and Haswell and later. It's also efficient on Zen1: the only lane-crossing shuffle being the 128-bit broadcast-load.

Using an in-lane shuffle is lower-latency than lane-crossing on both Intel and Zen2 (256-bit shuffle execution units). This still requires a 32-byte shuffle control vector constant, but if you need to do this frequently it will typically / hopefully stay hot in cache.

__m256  duplicate4floats(void *p) {
   __m256 v = _mm256_broadcast_ps((const __m128 *) p);   // vbroadcastf128
   v = _mm256_permutevar_ps(v, _mm256_set_epi32(3,3, 2,2,  1,1, 0,0));  // vpermilps
   return v;
}

Modern CPUs handle broadcast-loads right in the load port, no shuffle uop needed. (Sandybridge does need a port 5 shuffle uop for vbroadcastf128, unlike narrower broadcasts, but Haswell and later are purely port 2/3. But SnB doesn't support AVX2 so a lane-crossing shuffle with granularity less than 128-bit wasn't an option.)

So even if AVX2 is available, I think AVX1 instructions are more efficient here. On Zen1, vbroadcastf128 is 2 uops, vs. 1 for a 128-bit vmovups, but vpermps (lane-crossing) is 3 uops vs. 2 for vpermilps.

Unfortunately, clang pessimizes this into a vmovups load and a vpermps ymm, but GCC compiles it as written. (Godbolt)

If you wanted to avoid using a shuffle-control vector constant, vpmovzxdq ymm, [mem] (2 uops on Intel) could get the elements set up for vmovsldup (1 uops in-lane shuffle). Or broadcast-load and vunpckl/hps then blend?

I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions.

Get a better compiler, then! (Or remember to enable optimization.)

__m256  duplicate4floats_naive(const float *p) {
   return _mm256_set_ps(p[3],p[3], p[2], p[2], p[1],p[1], p[0],p[0]);
}

compiles with gcc (https://godbolt.org/z/dMzh3fezE) into

duplicate4floats_naive(float const*):
        vmovups xmm1, XMMWORD PTR [rdi]
        vpermilps       xmm0, xmm1, 80
        vpermilps       xmm1, xmm1, 250
        vinsertf128     ymm0, ymm0, xmm1, 0x1
        ret

So 3 shuffle uops, not great. And it could have used vshufps instead of vpermilps to save code-size and let it run on more ports on Ice Lake. But still vastly better than 8 instructions.

clang's shuffle optimizer makes the same asm as with my optimized intrinsics, because that's how clang is. It's pretty decent optimization, just not quite optimal.

duplicate4floats_naive(float const*):
        vmovups xmm0, xmmword ptr [rdi]
        vmovaps ymm1, ymmword ptr [rip + .LCPI1_0] # ymm1 = [0,0,1,1,2,2,3,3]
        vpermps ymm0, ymm1, ymm0
        ret

Proper use of _mm256_maskload_ps for loading less than 8 floats into __m256

A doubleword is 32-bits, not 64. Word = 16, doubleword = 32, quadword = 64. The first two elements get selected because -1 is all ones across all 64 bits, so when the maskload treats it as two 32-bit values instead of one 64-bit value the highest bit of both elements will be set. 0xFFFFFFFF, OTOH, is the least sigificant 32 bits set and the most significant 32 bits unset. Since x86 is little-endian the least significant bits come first, which is why you end up with he first element selected but not the second.

The documentation in the intrinsics guide is much better here.

Note that on GCC/clang, __m256i is implemented using vector extensions. MSVC, however, does not support vector extensions so your code won't work there. Also, both GCC and clang use a vector of 64-bit values even though the same __m256i type is used for all integer vectors, so you'll probably want to use _mm256_set_epi32, _mm256_setr_epi32 or _mm256_load_si256 to create your _load_mask anyways.

Oh, names starting with an underscore are reserved in both C and C++. Don't do that. You can use a trailing underscore if you really need to convey that it's an internal variable or something, but I don't really see a reason to do that in tho code you've posted above.

SSE load unsigned char to short

Not completely clear what you want.

But if you want SSE register with one short value per each input byte, then you probably need this (untested):

__declspec( align( 16 ) ) unsigned char foo1[ 16 ];
// Fill your array with data

const __m128i src = _mm_load_si128( ( __m128i* )foo1 );
const __m128i zero = _mm_setzero_si128();
const __m128i lower = _mm_unpacklo_epi8( src, zero );   // First 8 short values
const __m128i higher = _mm_unpackhi_epi8( src, zero );  // Last 8 short values

SIMD - how to add corresponding values from 2 vectors of different element widths (char or uint8_t adding to int)

Perhaps I need each byte of the vector my_char_mask_my_m128i - how to transform it into 4 bytes?

You're looking for the SSE4.1 intrinsic _mm_cvtepi8_epi32(), which takes the first 4 (signed) 8-bit integers in the SSE vector and sign-extends them into 32-bit integers. Combine that with some shifting to move the next 4 into place for the next extension, and you get something like:

#include <iostream>
#include <cstdint>
#include <emmintrin.h>
#include <smmintrin.h>

void print_int4(__m128i vec) {
  alignas(16) std::int32_t ints[4];
  _mm_store_si128(reinterpret_cast<__m128i*>(ints), vec);
  std::cout << '[' << ints[0] << ", " << ints[1] << ", " << ints[2] << ", "
            << ints[3] << ']';
}

int main(void) {
  alignas(16) std::int32_t 
    my_int_sequence[16] = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 };
  alignas(16) std::int8_t
    my_char_mask[16] = { 1,0,1,1,0,1,0,1,1,1,0,1,0,1,0,1 };

  __m128i char_mask = _mm_load_si128(reinterpret_cast<__m128i*>(my_char_mask));
  
  // Loop through the 32-bit int array 4 at a time
  for (int n = 0; n < 16; n += 4) {
    // Load the next 4 ints
    __m128i vec =
      _mm_load_si128(reinterpret_cast<__m128i*>(my_int_sequence + n));
    // Convert the next 4 chars to ints
    __m128i chars_to_add = _mm_cvtepi8_epi32(char_mask);
    // Shift out those 4 chars
    char_mask = _mm_srli_si128(char_mask, 4);
    // And add together
    __m128i sum = _mm_add_epi32(vec, chars_to_add);

    print_int4(vec);
    std::cout << " + ";
    print_int4(chars_to_add);
    std::cout << " = ";
    print_int4(sum);
    std::cout << '\n';
  }
}

Example (Note that you usually have to tell your compiler to generate SSE 4.1 instructions - with g++ and clang++ use the appropriate -march=XXXX option or -msse4.1):

$ g++ -O -Wall -Wextra -std=gnu++11 -msse4.1 demo.cc
$ ./a.out
[0, 1, 2, 3] + [1, 0, 1, 1] = [1, 1, 3, 4]
[4, 5, 6, 7] + [0, 1, 0, 1] = [4, 6, 6, 8]
[8, 9, 10, 11] + [1, 1, 0, 1] = [9, 10, 10, 12]
[12, 13, 14, 15] + [0, 1, 0, 1] = [12, 14, 14, 16]

Alternative version suggested by Peter Cordes if your compiler is new enough to have _mm_loadu_si32():

  // Loop through the 32-bit int array 4 at a time
  for (int n = 0; n < 16; n += 4) {
    // Load the next 4 ints
    __m128i vec =
      _mm_load_si128(reinterpret_cast<__m128i*>(my_int_sequence + n));
    // Load the next 4 chars
    __m128i char_mask = _mm_loadu_si32(my_char_mask + n);
    // Convert them to ints
    __m128i chars_to_add = _mm_cvtepi8_epi32(char_mask);
    // And add together
    __m128i sum = _mm_add_epi32(vec, chars_to_add);
 
    // Do more stuff
}

Loading 8 Chars from Memory into an _M256 Variable as Packed Single Precision Floats