Sse Integer Division

How do I use integer division SSE instructions?

Your code compiles fine with a recent version Intel's ICC compiler.
The function _mm_idiv_epi32 is an SVML instruction. The SVML library comes bundled with the Intel ICC compiler. If you don't have access to or can't use ICC, one way to obtain a linkable SVML might be by installing and linking to OpenCL.

SSE division by integer

You should take some time to study the instruction set reference, so you at least get a rough idea what kind of possibilities you have. Also, you should read the appropriate ABI docs for the calling convention.

That said, the answer to your first question is float return values should be passed back in xmm0 and you can convert from integer to float using CVTSI2SS (or CVTSI2SD for double precision).

Also note you should be using the proper scalar/packed and float/double versions. divpd is packed double, while you need scalar single, so you really want divss.

PS: your question is specifically not about FPU or MMX. Rather, it is about SSE.

How to divide 16-bit integer by 255 with using SSE?

There is an integer approximation of division by 255:

inline int DivideBy255(int value)
{
    return (value + 1 + (value >> 8)) >> 8;
}

So with using of SSE2 it will look like:

inline __m128i DivideI16By255(__m128i value)
{
    return _mm_srli_epi16(_mm_add_epi16(
        _mm_add_epi16(value, _mm_set1_epi16(1)), _mm_srli_epi16(value, 8)), 8);
}

For AVX2:

inline __m256i DivideI16By255(__m256i value)
{
    return _mm256_srli_epi16(_mm256_add_epi16(
        _mm256_add_epi16(value, _mm256_set1_epi16(1)), _mm256_srli_epi16(value, 8)), 8);
}

For Altivec (Power):

typedef __vector int16_t v128_s16;
const v128_s16 K16_0001 = {1, 1, 1, 1, 1, 1, 1, 1};
const v128_s16 K16_0008 = {8, 8, 8, 8, 8, 8, 8, 8};

inline v128_s16 DivideBy255(v128_s16 value)
{
    return vec_sr(vec_add(vec_add(value, K16_0001), vec_sr(value, K16_0008)), K16_0008);
}

For NEON (ARM):

inline int16x8_t DivideI16By255(int16x8_t value)
{
    return vshrq_n_s16(vaddq_s16(
        vaddq_s16(value, vdupq_n_s16(1)), vshrq_n_s16(value, 8)), 8);
}

How to divide a __m256i vector by an integer variable?

You can do this with _mm256_mulhrs_epi16. This does a fixed-point multiply, so you just set the multiplicand vector to 32768 / b:

inline __m256i _mm256_div_epi16 (const __m256i va, const int b)
{
    __m256i vb = _mm256_set1_epi16(32768 / b);
    return _mm256_mulhrs_epi16(va, vb);
}

Note that this assumes b > 1.

Divide 8-bit integers by 4 (or shift) using SSE

Unfortunately there are no SSE shift instructions for 8 bit elements. If the elements are 8 bit unsigned then you can use a 16 bit shift and mask out the unwanted high bits, e.g.

v = _mm_srli_epi16(v, 2);
v = _mm_and_si128(v, _mm_set1_epi8(0x3f));

For 8 bit signed elements it's a little fiddlier, but still possible, although it might just be easier to unpack to 16 bits, do the shifts, then pack back to 8 bits.

Fastest method of vectorized integer division by non-constant divisor

Dividing all elements of a vector by the same scalar can be done with integer multiply and shift. libdivide (C/C++, zlib license) provides some inline functions to do this for scalars (e.g. int), and for dividing vectors by scalars. Also see SSE integer division? (as you mention in your question) for a similar technique giving approximate results. It's more efficient if the same scalar will be applied to lots of vectors. libdivide doesn't say anything about the results being inexact, but I haven't investigated.

re: your code:
You have to be careful about checking what the compiler actually produces, when giving it a trivial loop like that. e.g. is it actually loading/storing back to RAM every iteration? Or is it keeping variables live in registers, and only storing at the end?

Your benchmark is skewed in favour of the integer-division loop, because the vector divider isn't kept 100% occupied in the vector loop, but the integer divider is kept 100% occupied in the int loop. (These paragraphs were added after the discussion in comments. The previous answer didn't explain as much about keeping the dividers fed, and dependency chains.)

You only have a single dependency chain in your vector loop, so the vector divider sits idle for several cycles every iteration after producing the 2nd result, while the chain of convert fp->si, pack, unpack, convert si->fp happens. You've set things up so your throughput is limited by the length of the entire loop-carried dependency chain, rather than the throughput of the FP dividers. If the data each iteration was independent (or there were at least several independent values, like how you have 8 array elements for the int loop), then the unpack/convert and convert/pack of one set of values would overlap with the divps execution time for another vector. The vector divider is only partially pipelined, but everything else if fully pipelined.

This is the difference between throughput and latency, and why it matters for a pipelined out-of-order execution CPU.

Other stuff in your code:

You have __m128 xmm1 = _mm_set1_ps(i); in the inner loop. _set1 with an arg that isn't a compile-time constant is usually at least 2 instructions: movd and pshufd. And in this case, an int-to-float conversion, too. Keeping a float-vector version of your loop counter, which you increment by adding a vector of 1.0, would be better. (Although this probably isn't throwing off your speed test any further, because this excess computation can overlap with other stuff.)

Unpacking with zero works fine. SSE4.1 __m128i _mm_cvtepi16_epi32 (__m128i a) is another way. pmovsxwd is the same speed, but doesn't need a zeroed register.

If you're going to convert to FP for divide, have you considered just keeping your data as FP for a while? Depends on your algorithm how you need rounding to happen.

performance on recent Intel CPUs

divps (packed single float) is 10-13 cycle latency, with a throughput of one per 7 cycles, on recent Intel designs. div / idiv r16 ((unsigned) integer divide in GP reg) is 23-26 cycle latency, with one per 9 or 8 cycle throughput. div is 11 uops, so it even gets in the way of other things issuing / executing for some of the time it's going through the pipeline. (divps is a single uop.) So, Intel CPUs are not really designed to be fast at integer division, but make an effort for FP division.

So just for the division alone, a single integer division is slower than a vector FP division. You're going to come out ahead even with the conversion to/from float, and the unpack/pack.

If you can do the other integer ops in vector regs, that would be ideal. Otherwise you have to get the integers into / out of vector regs. If the ints are in RAM, a vector load is fine. If you're generating them one at a time, PINSRW is an option, but it's possible that just storing to memory to set up for a vector load would be a faster way to load a full vector. Similar for getting the data back out, with PEXTRW or by storing to RAM. If you want the values in GP registers, skip the pack after converting back to int, and just MOVD / PEXTRD from whichever of the two vector regs your value is in. insert/extract instructions take two uops on Intel, which means they take up two "slots", compared to most instructions taking only one fused-domain uop.

Your timing results, showing that the scalar code doesn't improve with compiler optimizations, is because the CPU can overlap the verbose non-optimized load/store instructions for other elements while the divide unit is the bottleneck. The vector loop on the other hand only has one or two dependency chains, with every iteration dependent on the previous, so extra instructions adding latency can't be overlapped with anything. Testing with -O0 is pretty much never useful.

AVX divide __m256i packed 32-bit integers by two (no AVX2)

Assuming you know what you’re doing, here’s that function.

inline __m256i div2_epi32( __m256i vec )
{
    // Split the 32-byte vector into 16-byte ones
    __m128i low = _mm256_castsi256_si128( vec );
    __m128i high = _mm256_extractf128_si256( vec, 1 );
    // Shift the lanes within each piece; replace with _mm_srli_epi32 for unsigned version
    low = _mm_srai_epi32( low, 1 );
    high = _mm_srai_epi32( high, 1 );
    // Combine back into 32-byte vector
    vec = _mm256_castsi128_si256( low );
    return _mm256_insertf128_si256( vec, high, 1 );
}

However, doing that is not necessarily faster than dealing with 16-byte vectors. On most CPUs, the performance of these insert/extract instructions ain’t great, except maybe AMD Zen 1 CPU.