A Fast Method to Round a Double to a 32-Bit Int Explained

A fast method to round a double to a 32-bit int explained

A value of the double floating-point type is represented like so:

double representation

and it can be seen as two 32-bit integers; now, the int taken in all the versions of your code (supposing it’s a 32-bit int) is the one on the right in the figure, so what you are doing in the end is just taking the lowest 32 bits of mantissa.

Now, to the magic number; as you correctly stated, 6755399441055744 is 2⁵¹ + 2⁵²; adding such a number forces the double to go into the “sweet range” between 2⁵² and 2⁵³, which, as explained by Wikipedia, has an interesting property:

Between 2⁵² = 4,503,599,627,370,496 and 2⁵³ = 9,007,199,254,740,992, the representable numbers are exactly the integers.

This follows from the fact that the mantissa is 52 bits wide.

The other interesting fact about adding 2⁵¹ + 2⁵² is that it affects the mantissa only in the two highest bits—which are discarded anyway, since we are taking only its lowest 32 bits.

Last but not least: the sign.

IEEE 754 floating point uses a magnitude and sign representation, while integers on “normal” machines use 2’s complement arithmetic; how is this handled here?

We talked only about positive integers; now suppose we are dealing with a negative number in the range representable by a 32-bit int, so less (in absolute value) than (−2³¹ + 1); call it −a. Such a number is obviously made positive by adding the magic number, and the resulting value is 2⁵² + 2⁵¹ + (−a).

Now, what do we get if we interpret the mantissa in 2’s complement representation? It must be the result of 2’s complement sum of (2⁵² + 2⁵¹) and (−a). Again, the first term affects only the upper two bits, what remains in the bits 0–50 is the 2’s complement representation of (−a) (again, minus the upper two bits).

Since reduction of a 2’s complement number to a smaller width is done just by cutting away the extra bits on the left, taking the lower 32 bits gives us correctly (−a) in 32-bit, 2’s complement arithmetic.

Need a fast method to convert large amount of double to string

fast method to convert large amount of double to string

For full double range application, use sprintf(buf, "%a", some_double). If decimal output required, use "%e".

Any other code will only be faster if it is comprises accuracy or allowable input range somehow.

The usual approach is to convert double x to some wide scaled integer and convert that to a string. This implies limitations on x not clearly expressed yet by OP.

Even if some other approaches appears faster, it may not be faster as code evolves or is ported.

What OP needs to post is speed test code for objective performance assessment.

is there any methods can do round(a, n) in c++?

AFAIK the standard library provides no such function, but it shouldn't be too hard to roll out your own:

#include <iostream>
#include <cmath>

// fast pow for int, credit to https://stackoverflow.com/a/101613/13188071
int ipow(int base, int exp)
{
    int result = 1;
    while (true)
    {
        if (exp & 1)
            result *= base;
        exp >>= 1;
        if (exp == 0)
            break;
        base *= base;
    }

    return result;
}

double round_prec(double n, int prec)
{
    return std::round(n * ipow(10, prec)) / ipow(10, prec);
}

int main()
{
    std::cout << round_prec(3.1415, 2) << '\n';
}

Output:

3.14

This is, however, a bit of a roundabout way of doing it, there's probably a better way that I don't know of.

C: Better method for rounding down double to int

They are fundamentally different. Cast to int will truncate the non-integer part toward 0. floor will do the same, but toward, -infinity.

Example:

double d = -1.234;
printf("%d %d\n", (int)d, (int)floor(d));

Yields:

-1 -2

64 vs 32 bit double parsing issue with the round-trip format specifier R

I found this interesting point from Microsoft which might explain the issue

In some cases, Double values formatted with the "R" standard numeric
format string do not successfully round-trip if compiled using the
/platform:x64 or /platform:anycpu switches and run on 64-bit systems.
To work around this problem, you can format Double values by using the
"G17" standard numeric format string. The following example uses the
"R" format string with a Double value that does not round-trip
successfully, and also uses the "G17" format string to successfully
round-trip the original value.

The comment and an example can be found here: https://msdn.microsoft.com/en-us/library/kfsatb94(v=vs.110).aspx

Did old DOS C compilers implement double as 32-bit?

As far as I know there was no C compiler targeting MS-DOS that used a 32-bit wide double, instead they all used a 64-bit wide double. This was certainly the case by the early 90's. Based on quick read of the "Floating Point for Real-Time 3D" chapter of the book, it appears the Michael Abrash thought that floating-point math of any precision was too slow on anything less than a Pentium CPU. Either the floating-point code you're looking was intended for Pentium CPUs or it was used on a non-critical path where performance doesn't matter. For performance critical code meant for earlier CPUs, Abrash implies that he would've used fixed-point arithmetic instead.

In a lot of cases using float instead of double wouldn't have actually made much difference. There's a few reasons. First, if you don't have an x87 FPU (floating-point unit) installed (a separate chip before the '486), using less precision wouldn't improve performance enough to make software emulated floating-point arithmetic fast enough to be useful for game. The second is that the performance of most x87 FPU operations wasn't actually affected by precision. On a Pentium CPU only division was faster if performed at a narrower precision. For earlier x87 FPUs I'm not sure precision affected division, though it could affect the performance of multiplication on the 80387. On all x87 FPUs addition would've been the same speed regardless of precision.

The third is that the specific C data type used, whether a 32-bit float, the 64-bit double, or even the 80-bit long double that many compilers supported, didn't actually affect the precision the FPU used during calculations. This is because the FPU didn't have different instructions (or encodings) for the three different precisions it supported. There was no way to tell it perform a float addition or a double divide. Instead it performed all arithmetic at a given precision that was set in the FPU's control register. (Or more accurately stated, it performed arithmetic as if using infinite precision and then rounding the result to the set precision.) While it would've been possible to change this register every time a floating-point instruction is used, this would cause massive decrease in performance, so compilers never did this. Instead they just set it to either 80-bit or 64-bit precision at program startup and left it that way.

Now it was actually a common technique for 3D games to set the FPU to single-precision. This meant floating-point arithmetic, whether using double or float types, would be performed using single-precision arithmetic. While this would end up only affecting the performance of floating-point divides, 3D graphics programming tends to do a lot divisions in critical code (eg. perspective divides), so this could have a significant performance improvement.

There is however one way that using float instead of double could improve performance, and that's simply because a float takes up half the space of a double. If you have a lot of floating-point values then having to read and write half as much memory can make a significant difference in performance. However, on Pentium or earlier PCs this wouldn't result in the huge performance difference it would today. The gap between CPU speed and RAM speed wasn't as wide back then, and floating-point performance was a fair bit slower. Still, it would be a worth while optimization if the extra precision isn't needed, as is usually the case in games.

Note that modern x86 C compilers don't normally use x87 FPU instructions for floating-point arithmetic, instead they use scalar SSE instructions, which unlike the x87 instructions, do come in single- and double-precision versions. (But no 80-bit wide extended-precision versions.) Except for division, this doesn't make any performance difference, but does mean that results are always truncated to float or double precision after every operation. When doing math on the x87 FPU this truncation would only happen when the result was written to memory. This means SSE floating-point code has now has predictable results, while x87 FPU code had unpredictable results because it was in general hard to predict when the compiler would need to spill a floating-point register into memory to make room for something else.

So basically using float instead of double wouldn't have made a big performance difference except when storing floating-point values in a big array or other large data structure in memory.

How to efficiently perform double/int64 conversions with SSE/AVX?

There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64.

If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)

If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions:

If you don't care about infinity or NaN.
For double <-> int64_t, you only care about values in the range [-2^51, 2^51].
For double <-> uint64_t, you only care about values in the range [0, 2^52).

double -> uint64_t

//  Only works for inputs in the range: [0, 2^52)
__m128i double_to_uint64(__m128d x){
    x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
    return _mm_xor_si128(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
    );
}

double -> int64_t

//  Only works for inputs in the range: [-2^51, 2^51]
__m128i double_to_int64(__m128d x){
    x = _mm_add_pd(x, _mm_set1_pd(0x0018000000000000));
    return _mm_sub_epi64(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0018000000000000))
    );
}

uint64_t -> double

//  Only works for inputs in the range: [0, 2^52)
__m128d uint64_to_double(__m128i x){
    x = _mm_or_si128(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)));
    return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0010000000000000));
}

int64_t -> double

//  Only works for inputs in the range: [-2^51, 2^51]
__m128d int64_to_double(__m128i x){
    x = _mm_add_epi64(x, _mm_castpd_si128(_mm_set1_pd(0x0018000000000000)));
    return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0018000000000000));
}

Rounding Behavior:

For the double -> uint64_t conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)
For the double -> int64_t conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.

How does it work?

Despite this trick being only 2 instructions, it's not entirely self-explanatory.

The key is to recognize that for double-precision floating-point, values in the range [2^52, 2^53) have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.

To convert x from double -> uint64_t, you add the magic number M which is the floating-point value of 2^52. This puts x into the "normalized" range of [2^52, 2^53) and conveniently rounds away the fractional part bits.

Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of M. So rather than introducing an additional mask constant, we can simply subtract or XOR by M. XOR has more throughput.

Converting from uint64_t -> double is simply the reverse of this process. You add back the exponent bits of M. Then un-normalize the number by subtracting M in floating-point.

The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.

Related: A fast method to round a double to a 32-bit int explained

Full Range int64 -> double:

After many years, I finally had a need for this.

5 instructions for uint64_t -> double
6 instructions for int64_t -> double

uint64_t -> double

__m128d uint64_to_double_full(__m128i x){
    __m128i xH = _mm_srli_epi64(x, 32);
    xH = _mm_or_si128(xH, _mm_castpd_si128(_mm_set1_pd(19342813113834066795298816.)));          //  2^84
    __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0xcc);   //  2^52
    __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(19342813118337666422669312.));     //  2^84 + 2^52
    return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

int64_t -> double

__m128d int64_to_double_full(__m128i x){
    __m128i xH = _mm_srai_epi32(x, 16);
    xH = _mm_blend_epi16(xH, _mm_setzero_si128(), 0x33);
    xH = _mm_add_epi64(xH, _mm_castpd_si128(_mm_set1_pd(442721857769029238784.)));              //  3*2^67
    __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0x88);   //  2^52
    __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(442726361368656609280.));          //  3*2^67 + 2^52
    return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.

These are similar wim's answer below - but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.

How to speed up floating-point to integer number conversion?

Most of the other answers here just try to eliminate loop overhead.

Only deft_code's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.

Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU

Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.

A Fast Method to Round a Double to a 32-Bit Int Explained