How to Efficiently Perform Double/Int64 Conversions With Sse/Avx

How to efficiently perform double/int64 conversions with SSE/AVX?

There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64.

If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)


If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions:

  • If you don't care about infinity or NaN.
  • For double <-> int64_t, you only care about values in the range [-2^51, 2^51].
  • For double <-> uint64_t, you only care about values in the range [0, 2^52).

double -> uint64_t

//  Only works for inputs in the range: [0, 2^52)
__m128i double_to_uint64(__m128d x){
x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
return _mm_xor_si128(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
);
}

double -> int64_t

//  Only works for inputs in the range: [-2^51, 2^51]
__m128i double_to_int64(__m128d x){
x = _mm_add_pd(x, _mm_set1_pd(0x0018000000000000));
return _mm_sub_epi64(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0018000000000000))
);
}

uint64_t -> double

//  Only works for inputs in the range: [0, 2^52)
__m128d uint64_to_double(__m128i x){
x = _mm_or_si128(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)));
return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0010000000000000));
}

int64_t -> double

//  Only works for inputs in the range: [-2^51, 2^51]
__m128d int64_to_double(__m128i x){
x = _mm_add_epi64(x, _mm_castpd_si128(_mm_set1_pd(0x0018000000000000)));
return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0018000000000000));
}

Rounding Behavior:

  • For the double -> uint64_t conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)
  • For the double -> int64_t conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.

How does it work?

Despite this trick being only 2 instructions, it's not entirely self-explanatory.

The key is to recognize that for double-precision floating-point, values in the range [2^52, 2^53) have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.

To convert x from double -> uint64_t, you add the magic number M which is the floating-point value of 2^52. This puts x into the "normalized" range of [2^52, 2^53) and conveniently rounds away the fractional part bits.

Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of M. So rather than introducing an additional mask constant, we can simply subtract or XOR by M. XOR has more throughput.

Converting from uint64_t -> double is simply the reverse of this process. You add back the exponent bits of M. Then un-normalize the number by subtracting M in floating-point.

The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.

Related: A fast method to round a double to a 32-bit int explained


Full Range int64 -> double:

After many years, I finally had a need for this.

  • 5 instructions for uint64_t -> double
  • 6 instructions for int64_t -> double

uint64_t -> double

__m128d uint64_to_double_full(__m128i x){
__m128i xH = _mm_srli_epi64(x, 32);
xH = _mm_or_si128(xH, _mm_castpd_si128(_mm_set1_pd(19342813113834066795298816.))); // 2^84
__m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0xcc); // 2^52
__m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(19342813118337666422669312.)); // 2^84 + 2^52
return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

int64_t -> double

__m128d int64_to_double_full(__m128i x){
__m128i xH = _mm_srai_epi32(x, 16);
xH = _mm_blend_epi16(xH, _mm_setzero_si128(), 0x33);
xH = _mm_add_epi64(xH, _mm_castpd_si128(_mm_set1_pd(442721857769029238784.))); // 3*2^67
__m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0x88); // 2^52
__m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(442726361368656609280.)); // 3*2^67 + 2^52
return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.

These are similar wim's answer below - but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.

How to efficiently perform int8/int64 conversion with SSE?

The unpacking intrinsics are used here in a funny way. They "duplicate" the data, instead of adding sign-extension, as one would expect. For example, before the first iteration you have in your register the following

x x x x x x x x x x x x x x a b

If you convert a and b to 16 bits, you should get this:

x x x x x x x x x x x x A a B b

Here A and B are sign-extensions of a and b, that is, both of them are either 0 or -1.

Instead of this, your code gives

x x x x x x x x x x x x a a b b

And then you convert it to the proper result by shifting right.

However, you are not obliged to use the same operand twice in the "unpack" intrinsics. You could get the desired result if you "unpacked" the following two registers:

x x x x x x x x x x x x x x a b
x x x x x x x x x x x x x x A B

That is:

a = _mm_unpacklo_epi8(a, _mm_srai_epi8(a, 8));

(if that _mm_srai_epi8 intrinsic actually existed)


You can apply the same idea to the last stage of your conversion. You want to "unpack" the following two registers:

x x x x x x x x A A A a B B B b
x x x x x x x x A A A A B B B B

To get them, right-shift the 32-bit data:

_mm_srai_epi32(a, 24)
_mm_srai_epi32(a, 32)

So the last "unpack" is

_mm_unpacklo_epi32(_mm_srai_epi32(a, 24), _mm_srai_epi32(a, 32));

Efficiently load/compute/pack 64 double comparison results in uint64_t bitmask

Here’s an example. It packs the comparison results into bytes with whatever instructions were the most efficient, shuffles into correct order once per 32 numbers, and uses _mm256_movemask_epi8 to produce 32 bits at once.

// Compare 4 numbers, return 32 bytes with results of 4 comparisons:
// 00000000 11111111 22222222 33333333
inline __m256d compare4( const double* a, const double* b )
{
return _mm256_cmp_pd( _mm256_load_pd( a ), _mm256_load_pd( b ), _CMP_LT_OQ );
}

// Compare 8 numbers, combine 8 results into the following bytes:
// 0000 4444 1111 5555 2222 6666 3333 7777
inline __m256i compare8( const double* a, const double* b )
{
__m256 c0 = _mm256_castpd_ps( compare4( a, b ) );
__m256 c1 = _mm256_castpd_ps( compare4( a + 4, b + 4 ) );
return _mm256_castps_si256( _mm256_blend_ps( c0, c1, 0b10101010 ) );
}

// Compare 16 numbers, combine 16 results into following bytes:
// 00 44 11 55 88 CC 99 DD 22 66 33 77 AA EE BB FF
inline __m256i compare16( const double* a, const double* b )
{
__m256i c0 = compare8( a, b );
__m256i c1 = compare8( a + 8, b + 8 );
return _mm256_packs_epi32( c0, c1 );
}

inline uint32_t compare32( const double* a, const double* b )
{
// Compare 32 numbers and merge them into a single vector
__m256i c0 = compare16( a, b );
__m256i c1 = compare16( a + 16, b + 16 );
__m256i src = _mm256_packs_epi16( c0, c1 );

// We got the 32 bytes, but the byte order is screwed up in that vector:
// 0 4 1 5 8 12 9 13 16 20 17 21 24 28 25 29
// 2 6 3 7 10 14 11 15 18 22 19 23 26 30 27 31
// The following 2 instructions are fixing the order

// Shuffle 8-byte pieces across the complete vector
// That instruction is relatively expensive on most CPUs, but we only doing it once per 32 numbers
src = _mm256_permute4x64_epi64( src, _MM_SHUFFLE( 3, 1, 2, 0 ) );

// The order of bytes in the vector is still wrong:
// 0 4 1 5 8 12 9 13 2 6 3 7 10 14 11 15
// 13 16 20 17 21 24 28 25 18 22 19 23 26 30 27 31
// Need one more shuffle instruction

const __m128i perm16 = _mm_setr_epi8(
0, 2, 8, 10, 1, 3, 9, 11, 4, 6, 12, 14, 5, 7, 13, 15 );
// If you calling this in a loop and everything is inlined,
// the shuffling vector should be pre-loaded outside of the loop.
const __m256i perm = _mm256_broadcastsi128_si256( perm16 );

// Shuffle the bytes
src = _mm256_shuffle_epi8( src, perm );

// The order of bytes is now correct, can use _mm256_movemask_epi8 to make 32 bits of the result
return (uint32_t)_mm256_movemask_epi8( src );
}


uint64_t compareAndPack64( const double* a, const double* b )
{
uint64_t low = compare32( a, b );
uint64_t high = compare32( a + 32, b + 32 );
return low | ( high << 32 );
}

How to convert int 64 to int 32 with avx (but without avx-512)

So just truncation, not signed (or unsigned) saturation? (I asked because AVX-512 provides signed and unsigned saturation versions, as well as truncation. The non-AVX512 packs like _mm_packus_epi32 (packusdw) you were using always do saturation, you have to use plain shuffle instructions if you want packing with truncation before AVX-512. But if either is fine because the upper half is known zero, then yeah the pack instructions can be useful.)



Single vector __m256i -> __m128i

For a single vector, producing a narrower output, you could use vextracti128 with vshufps to pack a single __m256i into a __m128i. Before AVX-512, vshufps is one of the only 2-input shuffles that has any control input, not just a fixed interleave for example.

In C with intrinsics, you'd need _mm_castsi128_ps and back to keep the compiler happy using _mm_shuffle_ps on integer vectors, but modern CPUs don't have bypass delays for using FP shuffles between integer SIMD instructions. Or if you're just going to store it, you can leave the result as __m128 and use _mm_store_ps((float*)p, vec); (And yes it's still strict-aliasing safe to cast integer pointers to float* because the deref happens inside the intrinsic, not in pure C).

#include <immintrin.h>

__m128 cvtepi64_epi32_avx(__m256i v)
{
__m256 vf = _mm256_castsi256_ps( v ); // free
__m128 hi = _mm256_extractf128_ps(vf, 1); // vextractf128
__m128 lo = _mm256_castps256_ps128( vf ); // also free
// take the bottom 32 bits of each 64-bit chunk in lo and hi
__m128 packed = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2, 0, 2, 0)); // shufps
//return _mm_castps_si128(packed); // if you want
return packed;
}

This is 2 shuffles per 128-bit of output data. We can do better: 2 shuffles per 256-bit of output data. (Or even just 1 if we can have our input arranged nicely).



2x __m256i inputs producing a __m256i output

Fortunately clang spotted a better optimization than I did. I thought of 2x vpermd + vpblendd could do it, shuffling the low32 of each element in one vector to the bottom lane, or top lane in the other. (With a set_epi32(6,4,2,0, 6,4,2,0) shuffle control).

But clang optimized that into vshufps to get all the elements we want into one vector, then vpermpd (equivalent to vpermq) to get them into the correct order. (This is generally a good strategy, and I should have thought of that myself. :P Again, it's taking advantage of vshufps as a 2-input shuffle.)

Translating that back into intrinsics, we get code that will compile to that efficient asm for GCC or other compilers (Godbolt compiler explorer for this and the original):

// 2x 256 -> 1x 256-bit result
__m256i pack64to32(__m256i a, __m256i b)
{
// grab the 32-bit low halves of 64-bit elements into one vector
__m256 combined = _mm256_shuffle_ps(_mm256_castsi256_ps(a),
_mm256_castsi256_ps(b), _MM_SHUFFLE(2,0,2,0));
// {b3,b2, a3,a2 | b1,b0, a1,a0} from high to low

// re-arrange pairs of 32-bit elements with vpermpd (or vpermq if you want)
__m256d ordered = _mm256_permute4x64_pd(_mm256_castps_pd(combined), _MM_SHUFFLE(3,1,2,0));
return _mm256_castpd_si256(ordered);
}

It compiles to just 2 instructions with immediate shuffle controls, no vector constants. The source looks verbose but it's mostly just casts to keep the compiler happy about types.

# clang -O3 -march=haswell
pack64to32: # @pack64to32
vshufps ymm0, ymm0, ymm1, 136 # ymm0 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]
vpermpd ymm0, ymm0, 216 # ymm0 = ymm0[0,2,1,3]
ret


With input reordering to avoid lane-crossing: one vshufps

If you can arrange pairs of input vectors so they have 64-bit elements in {a0, a1 | a4, a5} and {a2, a3 | a6, a7} order, you only need an in-lane shuffle: the low 4x 32-bit elements come from the low halves of each 256-bit input, etc. You can get the job done with one _mm256_shuffle_ps. (Exactly as above, not needing the _mm256_permute4x64_pd). Credit to @chtz in comments under the question for this suggestion.

Mandelbrot doesn't need an interaction between elements, so you could probably use pairs of __m256i vectors with that arrangement of 64-bit elements.

If you're starting with an unrolled loop with something like {0,1,2,3} and {4,5,6,7} and using _mm256_add_epi64 with set1_epi64(8) to increment, you could instead start with {0,1,4,5} and {2,3,6,7} and everything should work identically. (Unless you're doing something else where it matters what order your elements are in?)



Asm vs. intrinsic names

Consult Intel's intrinsics guide to search by asm mnemonic; they're shorter to type and easier to think about in terms of what the machine can actually do, and asm mnemonics are needed for looking up performance in https://uops.info/table.html / https://agner.org/optimize/

How to efficiently perform double/int64 conversions with SSE/AVX?

There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64.

If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)


If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions:

  • If you don't care about infinity or NaN.
  • For double <-> int64_t, you only care about values in the range [-2^51, 2^51].
  • For double <-> uint64_t, you only care about values in the range [0, 2^52).

double -> uint64_t

//  Only works for inputs in the range: [0, 2^52)
__m128i double_to_uint64(__m128d x){
x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
return _mm_xor_si128(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
);
}

double -> int64_t

//  Only works for inputs in the range: [-2^51, 2^51]
__m128i double_to_int64(__m128d x){
x = _mm_add_pd(x, _mm_set1_pd(0x0018000000000000));
return _mm_sub_epi64(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0018000000000000))
);
}

uint64_t -> double

//  Only works for inputs in the range: [0, 2^52)
__m128d uint64_to_double(__m128i x){
x = _mm_or_si128(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)));
return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0010000000000000));
}

int64_t -> double

//  Only works for inputs in the range: [-2^51, 2^51]
__m128d int64_to_double(__m128i x){
x = _mm_add_epi64(x, _mm_castpd_si128(_mm_set1_pd(0x0018000000000000)));
return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0018000000000000));
}

Rounding Behavior:

  • For the double -> uint64_t conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)
  • For the double -> int64_t conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.

How does it work?

Despite this trick being only 2 instructions, it's not entirely self-explanatory.

The key is to recognize that for double-precision floating-point, values in the range [2^52, 2^53) have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.

To convert x from double -> uint64_t, you add the magic number M which is the floating-point value of 2^52. This puts x into the "normalized" range of [2^52, 2^53) and conveniently rounds away the fractional part bits.

Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of M. So rather than introducing an additional mask constant, we can simply subtract or XOR by M. XOR has more throughput.

Converting from uint64_t -> double is simply the reverse of this process. You add back the exponent bits of M. Then un-normalize the number by subtracting M in floating-point.

The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.

Related: A fast method to round a double to a 32-bit int explained


Full Range int64 -> double:

After many years, I finally had a need for this.

  • 5 instructions for uint64_t -> double
  • 6 instructions for int64_t -> double

uint64_t -> double

__m128d uint64_to_double_full(__m128i x){
__m128i xH = _mm_srli_epi64(x, 32);
xH = _mm_or_si128(xH, _mm_castpd_si128(_mm_set1_pd(19342813113834066795298816.))); // 2^84
__m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0xcc); // 2^52
__m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(19342813118337666422669312.)); // 2^84 + 2^52
return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

int64_t -> double

__m128d int64_to_double_full(__m128i x){
__m128i xH = _mm_srai_epi32(x, 16);
xH = _mm_blend_epi16(xH, _mm_setzero_si128(), 0x33);
xH = _mm_add_epi64(xH, _mm_castpd_si128(_mm_set1_pd(442721857769029238784.))); // 3*2^67
__m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0x88); // 2^52
__m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(442726361368656609280.)); // 3*2^67 + 2^52
return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

How to convert scalar code of the double version of VDT's Pade Exp fast_ex() approx into SSE2?

Yup, dividing two polynomials can often give you a better tradeoff between speed and precision than one huge polynomial. As long as there's enough work to hide the divpd throughput. (The latest x86 CPUs have pretty decent FP divide throughput. Still bad vs. multiply, but it's only 1 uop so it doesn't stall the pipeline if you use it rarely enough, i.e. mixed with lots of multiplies. Including in the surrounding code that uses exp)


However, _mm_cvtepi64_pd(_mm_cvttpd_epi64(px)); won't work with SSE2. Packed-conversion intrinsics to/from 64-bit integers requires AVX512DQ.

To do packed rounding to the nearest integer, ideally you'd use SSE4.1 _mm_round_pd(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC), (or truncation towards zero, or floor or ceil towards -+Inf).

But we don't actually need that.

The scalar code ends up with int n and double px both representing the same numeric value. It uses the bad/buggy floor(val+0.5) idiom instead of rint(val) or nearbyint(val) to round to nearest, and then converts that already-integer double to an int (with C++'s truncation semantics, but that doesn't matter because the double value's already an exact integer.)

With SIMD intrinsics, it appears to be easiest to just convert to 32-bit integer and back.

__m128i n  = _mm_cvtpd_epi32( _mm_mul_pd(log2e, x) );   // round to nearest
__m128d px = _mm_cvtepi32_pd( n );

Rounding to int with the desired mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an int.)

cvtsd2si and si2sd instructions are 2 uops each, and shuffle the 32-bit integers to packed in the low 64 bits of a vector. So to set up for 64-bit integer shifts to stuff the bits into a double again, you'll need to shuffle. The top 64 bits of n will be zeros, so we can use that to create 64-bit integer n lined up with the doubles:

n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3,1,2,0));   // 64-bit integers

But with just SSE2, there are workarounds. Converting to 32-bit integer and back is one option: you don't care about inputs too small or too large. But packed-conversion between double and int costs at least 2 uops on Intel CPUs each way, so a total of 4. But only 2 of those uops need the FMA units, and your code probably doesn't bottleneck on port 5 with all those multiplies and adds.

Or add a very large number and subtract it again: large enough that each double is 1 integer apart, so normal FP rounding does what you want. (This works for inputs that won't fit in 32 bits, but not double > 2^52. So either way that would work.) Also see How to efficiently perform double/int64 conversions with SSE/AVX? which uses that trick. I couldn't find an example on SO, though.


Related:

  • Fastest Implementation of Exponential Function Using AVX and Fastest Implementation of Exponential Function Using SSE have versions with other speed / precision tradeoffs, for _ps (packed single-precision float).

  • Fast SSE low precision exponential using double precision operations is at the other end of the spectrum, but still for double.

  • How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? discusses some existing libraries like SVML, and Agner Fog's VCL (GPL licensed). And glibc's libmvec.

Then of course I need to check the whole, since I'm not quite sure I've converted it correctly.

iterating over all 2^64 double bit-patterns is impractical, unlike for float where there are only 4 billion, but maybe iterating over all doubles that have the low 32 bits of their mantissa all zero would be a good start. i.e. check in a loop with

bitpatterns = _mm_add_epi64(bitpatterns, _mm_set1_epi64x( 1ULL << 32 ));
doubles = _mm_castsi128_pd(bitpatterns);

https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/


For those last few lines, correcting the input for out-of-range inputs:

The float version you quote just leaves out the range-check entirely. This is obviously the fastest way, if your inputs will always be in range or if you don't care about what happens for out-of-range inputs.

Alternate cheaper range-checking (maybe only for debugging) would be to turn out-of-range values into NaN by ORing the packed-compare result into the result. (An all-ones bit-pattern represents a NaN.)

__m128d out_of_bounds = _mm_cmplt_pd( limit, abs(initial_x) );  // abs = mask off the sign bit
result = _mm_or_pd(result, out_of_bounds);

In general, you can vectorize simple condition setting of a value using branchless compare + blend. Instead of if(x) y=0;, you have the SIMD equivalent of y = (condition) ? 0 : y;, on a per-element basis. SIMD compares produce a mask of all-zero / all-one elements so you can use it to blend.

e.g. in this case cmppd the input and blendvpd the output if you have SSE4.1. Or with just SSE2, and/andnot/or to blend. See SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation for a _ps version of both, _pd is identical.

In asm it will look like this:

; result in xmm0  (in need of fixups for out of range inputs)
; initial_x in xmm2
; constants:
; xmm5 = limit
; xmm6 = +Inf
cmpltpd xmm2, xmm5 ; xmm2 = input_x < limit ? 0xffff... : 0
andpd xmm0, xmm2 ; result = result or 0
andnpd xmm2, xmm6 ; xmm2 = 0 or +Inf (In that order because we used ANDN)
orpd xmm0, xmm2 ; result |= 0 or +Inf
; xmm0 = (input < limit) ? result : +Inf

(In an earlier version of the answer, I thought I was maybe saving a movaps to copy a register, but this is just a bog-standard blend. It destroys initial_x, so the compiler needs to copy that register at some point while calculating result, though.)


Optimizations for this special condition

Or in this case, 0.0 is represented by an all-zero bit-pattern, so do a compare that will produce true if in-range, and AND the output with that. (To leave it unchanged or force it to +0.0). This is better than _mm_blendv_pd, which costs 2 uops on most Intel CPUs (and the AVX 128-bit version always costs 2 uops on Intel). And it's not worse on AMD or Skylake.

+-Inf is represented by a bit-pattern of significand=0, exponent=all-ones. (Any other value in the significand represents +-NaN.) Since too-large inputs will presumably still leave non-zero significands, we can't just AND the compare result and OR that into the final result. I think we need to do a regular blend, or something as expensive (3 uops and a vector constant).

It adds 2 cycles of latency to the final result; both the ANDNPD and ORPD are on the critical path. The CMPPD and ANDPD aren't; they can run in parallel with whatever you do to compute the result.

Hopefully your compiler will actually use ANDPS and so on, not PD, for everything except the CMP, because it's 1 byte shorter but identical because they're both just bitwise ops. I wrote ANDPD just so I didn't have to explain this in comments.


You might be able to shorten the critical path latency by combining both fixups before applying to the result, so you only have one blend. But then I think you also need to combine the compare results.

Or since your upper and lower bounds are the same magnitude, maybe you can compare the absolute value? (mask off the sign bit of initial_x and do _mm_cmplt_pd(abs_initial_x, _mm_set1_pd(details::EXP_LIMIT))). But then you have to sort out whether to zero or set to +Inf.

If you had SSE4.1 for _mm_blendv_pd, you could use initial_x itself as the blend control for the fixup that might need applying, because blendv only cares about the sign bit of the blend control (unlike with the AND/ANDN/OR version where all bits need to match.)

__m128d  fixup = _mm_blendv_pd( _mm_setzero_pd(), _mm_set1_pd(INFINITY), initial_x );  // fixup = (initial_x signbit) ? 0 : +Inf
// see below for generating fixup with an SSE2 integer arithmetic-shift

const signbit_mask = _mm_castsi128_pd(_mm_set1_epi64x(0x7fffffffffffffff)); // ~ set1(-0.0)
__m128d abs_init_x = _mm_and_pd( initial_x, signbit_mask );

__m128d out_of_range = _mm_cmpgt_pd(abs_init_x, details::EXP_LIMIT);

// Conditionally apply the fixup to result
result = _mm_blendv_pd(result, fixup, out_of_range);


Related Topics



Leave a reply



Submit