Difference in Floating Point Arithmetics Between X86 and X64

Difference in floating point arithmetics between x86 and x64

The issue hinges on this expression:

bool bLarger2 = (a*c)<b;

I looked at the code generated under VS2008, not having VS2010 to hand. For 64 bit the code is:


000000013FD51100  movss       xmm1,dword ptr [a] 
000000013FD51106  mulss       xmm1,dword ptr [c] 
000000013FD5110C  movss       xmm0,dword ptr [b] 
000000013FD51112  comiss      xmm0,xmm1

For 32 bit the code is:


00FC14DC  fld         dword ptr [a] 
00FC14DF  fmul        dword ptr [c] 
00FC14E2  fld         dword ptr [b] 
00FC14E5  fcompp

So under 32 bit the calculation is performed in the x87 unit, and under 64 bit it is performed by the x64 unit.

And the difference here is that the x87 operations are all performed to higher than single precision. By default the calculations are performed to double precision. On the other hand the SSE unit operations are pure single precision calculations.

You can persuade the 32 bit unit to perform all calculations to single precision accuracy like this:

_controlfp(_PC_24, _MCW_PC);

When you add that to your 32 bit program you will find that the booleans are both set to false.

There is a fundamental difference in the way that the x87 and SSE floating point units work. The x87 unit uses the same instructions for both single and double precision types. Data is loaded into registers in the x87 FPU stack, and those registers are always 10 byte Intel extended. You can control the precision using the floating point control word. But the instructions that the compiler writes are ignorant of that state.

On the other hand, the SSE unit uses different instructions for operations on single and double precision. Which means that the compiler can emit code that is in full control of the precision of the calculation.

So, the x87 unit is the bad guy here. You can maybe try to persuade your compiler to emit SSE instructions even for 32 bit targets. And certainly when I compiled your code under VS2013 I found that both 32 and 64 bit targets emitted SSE instructions.

float arithmetic and x86 and x64 context

As others have speculated in the comments, the difference you're observing is the result of differential precision when doing floating-point arithmetic, arising out of a difference between how the 32-bit and 64-bit builds perform these operations.

Your code is translated by the 32-bit (x86) JIT compiler into the following object code:

fld   qword ptr ds:[0E63308h]  ; Load constant 1.0e+11 onto top of FPU stack.
sub   esp, 8                   ; Allocate 8 bytes of stack space.
fstp  qword ptr [esp]          ; Pop top of FPU stack, putting 1.0e+11 into
                               ;  the allocated stack space at [esp].
call  73792C70                 ; Call internal helper method that converts the
                               ;  double-precision floating-point value stored at [esp]
                               ;  into a 64-bit integer, and returns it in edx:eax.
                               ; At this point, edx:eax == 100000000000.

Notice that the optimizer has folded your arithmetic computation ((1000f * 1000f) * 100000f) to the constant 1.0e+11. It has stored this constant in the binary's data segment, and loads it onto the top of the x87 floating-point stack (the fld instruction). The code then allocates 8 bytes of stack space (enough for a 64-bit double-precision floating-point value) by subtracting the stack pointer (esp). The fstp instruction pops the value off the top of the x87 floating-point stack, and stores it in its memory operand. In this case, it stores it into the 8 bytes that we just allocated on the stack. All of this shuffling is rather pointless: it could have just loaded the floating-point constant 1.0e+11 directly into memory, by-passing the trip through the x87 FPU, but the JIT optimizer isn't perfect. Finally, the JIT emitted code to call an internal helper function that converts the double-precision floating-point value stored in memory (1.0e+11) into a 64-bit integer. The 64-bit integer result is returned in the register pair edx:eax, as is customary for 32-bit Windows calling conventions. When this code completes, edx:eax contains the 64-bit integer value 100000000000, or 1.0e+11, exactly as you would expect.

(Hopefully the terminology here is not too confusing. Note that there are two different "stacks". The x87 FPU has a series of registers, which are accessed like a stack. I refer to this as the FPU stack. Then, there is the stack with which you are probably familiar, the one stored in main memory and accessed via the stack pointer, esp.)

However, things are done a bit differently by the 64-bit (x86-64) JIT compiler. The big difference here is that 64-bit targets always use SSE2 instructions for floating-point operations, since all chips that support AMD64 also support SSE2, and SSE2 is more efficient and more flexible than the old x87 FPU. Specifically, the 64-bit JIT translates your code into the following:

movsd  xmm0, mmword ptr [7FFF7B1A44D8h]  ; Load constant into XMM0 register.
call   00007FFFDAC253B0                  ; Call internal helper method that converts the
                                         ;  floating-point value in XMM0 into a 64-bit int
                                         ;  that is returned in RAX.

Things immediately go wrong here, because the constant value being loaded by the first instruction is 0x42374876E0000000, which is the binary floating-point representation of 99999997952.0. The problem is not the helper function that is doing the conversion to a 64-bit integer. Instead, it is the JIT compiler itself, specifically the optimizer routine that is pre-computing the constant.

To gain some insight into how that goes wrong, we'll turn off JIT optimization and see what the code looks like:

movss    xmm0, dword ptr [7FFF7B1A4500h]  
movss    dword ptr [rbp-4], xmm0  
movss    xmm0, dword ptr [rbp-4]  
movss    xmm1, dword ptr [rbp-4]  
mulss    xmm0, xmm1  
mulss    xmm0, dword ptr [7FFF7B1A4504h]  
cvtss2sd xmm0, xmm0  
call     00007FFFDAC253B0

The first movss instruction loads a single-precision floating-point constant from memory into the xmm0 register. This time, however, that constant is 0x447A0000, which is the precise binary representation of 1000—the initial float value from your code.

The second movss instruction turns right around and stores this value from the xmm0 register into memory, and the third movss instruction re-loads the just-stored value from memory back into the xmm0 register. (Told you this was unoptimized code!) It also loads a second copy of that same value from memory into the xmm1 register, and then multiplies (mulss) the two single-precision values in xmm0 and xmm1 together. This is the literal translation of your val = val * val code. The result of this operation (which ends up in xmm0) is 0x49742400, or 1.0e+6, precisely as you would expect.

The second mulss instruction performs the val * 100000.0f operation. It implicitly loads the single-precision floating-point constant 1.0e+5 and multiplies it with the value in xmm0 (which, recall, is 1.0e+6). Unfortunately, the result of this operation is not what you would expect. Instead of 1.0e+11, it is actually 9.9999998e+10. Why? Because 1.0e+11 cannot be precisely represented as a single-precision floating-point value. The closest representation is 0x51BA43B7, or 9.9999998e+10.

Finally, the cvtss2sd instruction performs an in-place conversion of the (wrong!) scalar single-precision floating-point value in xmm0 to a scalar double-precision floating-point value. In a comment to the question, Neitsa suggested that this might be the source of the problem. In fact, as we have seen, the source of the problem is the previous instruction, the one that does the multiplication. The cvtss2sd just converts an already imprecise single-precision floating-point representation (0x51BA43B7) to an imprecise double-precision floating point representation: 0x42374876E0000000, or 99999997952.0.

And this is precisely the series of operations performed by the JIT compiler to produce the initial double-precision floating-point constant that is loaded into the xmm0 register in the optimized code.

Although I have been implying throughout this answer that the JIT compiler is to blame, that is not the case at all! If you had compiled the identical code in C or C++ while targeting the SSE2 instruction set, you would have gotten exactly the same imprecise result: 99999997952.0. The JIT compiler is performing just as one would expect it to—if, that is, one's expectations are correctly calibrated to the imprecision of floating-point operations!

So, what is the moral of this story? There are two of them. First, floating-point operations are tricky and there is a lot to know about them. Second, in light of this, always use the most precision that you have available when doing floating-point arithmetic!

The 32-bit code is producing the correct result because it is operating with double-precision floating-point values. With 64 bits to play with, a precise representation of 1.0e+11 is possible.

The 64-bit code is producing the incorrect result because it is using single-precision floating-point values. With only 32 bits to play with, a precise representation of 1.0e+11 is not possible.

You would not have had this problem if you had used the double type to begin with:

double val = 1000.0;
val = val * val;
return (ulong)(val * 100000.0);

This ensures the correct result on all architectures, with no need for ugly, non-portable bit-manipulation hacks like those suggested in the question. (Which still cannot ensure the correct result, since it doesn't solve the root of the problem, namely that your desired result cannot be directly represented in a 32-bit single-precision float.)

Even if you have to take input as a single-precision float, convert it immediately into a double, and do all of your subsequent arithmetic manipulations in the double-precision space. That would still have solved this problem, since the initial value of 1000 can be precisely represented as a float.

Why does Math.Exp give different results between 32-bit and 64-bit, with same input, same hardware

Yes rounding errors, and it is effectively NOT the same hardware. The 32 bit version is targeting a different set of instructions and register sizes.

64 bit floating point porting issues

There is no inherent need for floats and doubles to behave differently between 32-bit and 64-bit code but frequently they do. The answer to your question is going to be platform and compiler specific so you need to say what platform you are porting from and what platform you are porting to.

On intel x86 platforms 32-bit code often uses the x87 co-processor instruction set and floating-point register stack for maximum compatibility whereas on amb64/x86_64 platforms, the SSE* instructions and xmm* registers and are often used instead. These have different precision characteristics.

Post edit:

Given your platform, you might want to consider trying the -mfpmath=387 (the default for i386 gcc) on your x86_64 build to see if this explains the differing results. You may also want to look at the settings for all the -fmath-* compiler switches to ensure that they match what you want in both builds.

Does 64-bit floating point numbers behave identically on all modern PCs?

Modern processors that implement 64-bit floating-point typically implement something that is close to the IEEE 754-1985 standard, recently superseded by the 754-2008 standard.

The 754 standard specifies what result you should get from certain basic operations, notably addition, subtraction, multiplication, division, square root, and negation. In most cases, the numeric result is specified precisely: The result must be the representable number that is closest to the exact mathematical result in the direction specified by the rounding mode (to nearest, toward infinity, toward zero, or toward negative infinity). In "to nearest" mode, the standard also specifies how ties are broken.

Because of this, operations that do not involve exception conditions such as overflow will get the same results on different processors that conform to the standard.

However, there are several issues that interfere with getting identical results on different processors. One of them is that the compiler is often free to implement sequences of floating-point operations in a variety of ways. For example, if you write "a = bc + d" in C, where all variables are declared double, the compiler is free to compute "bc" in either double-precision arithmetic or something with more range or precision. If, for example, the processor has registers capable of holding extended-precision floating-point numbers and doing arithmetic with extended-precision does not take any more CPU time than doing arithmetic with double-precision, a compiler is likely to generate code using extended-precision. On such a processor, you might not get the same results as you would on another processor. Even if the compiler does this regularly, it might not in some circumstances because the registers are full during a complicated sequence, so it stores the intermediate results in memory temporarily. When it does that, it might write just the 64-bit double rather than the extended-precision number. So a routine containing floating-point arithmetic might give different results just because it was compiled with different code, perhaps inlined in one place, and the compiler needed registers for something else.

Some processors have instructions to compute a multiply and an add in one instruction, so "bc + d" might be computed with no intermediate rounding and get a more accurate result than on a processor that first computes bc and then adds d.

Your compiler might have switches to control behavior like this.

There are some places where the 754-1985 standard does not require a unique result. For example, when determining whether underflow has occurred (a result is too small to be represented accurately), the standard allows an implementation to make the determination either before or after it rounds the significand (the fraction bits) to the target precision. So some implementations will tell you underflow has occurred when other implementations will not.

A common feature in processors is to have an "almost IEEE 754" mode that eliminates the difficulty of dealing with underflow by substituting zero instead of returning the very small number that the standard requires. Naturally, you will get different numbers when executing in such a mode than when executing in the more compliant mode. The non-compliant mode may be the default set by your compiler and/or operating system, for reasons of performance.

Note that an IEEE 754 implementation is typically not provided just by hardware but by a combination of hardware and software. The processor may do the bulk of the work but rely on the software to handle certain exceptions, set certain modes, and so on.

When you move beyond the basic arithmetic operations to things like sine and cosine, you are very dependent on the library you use. Transcendental functions are generally calculated with carefully engineered approximations. The implementations are developed independently by various engineers and get different results from each other. On one system, the sin function may give results accurate within an ULP (unit of least precision) for small arguments (less than pi or so) but larger errors for large arguments. On another system, the sin function might give results accurate within several ULP for all arguments. No current math library is known to produce correctly rounded results for all inputs. There is a project, crlibm (Correctly Rounded Libm), that has done some good work toward this goal, and they have developed implementations for significant parts of the math library that are correctly rounded and have good performance, but not all of the math library yet.

In summary, if you have a manageable set of calculations, understand your compiler implementation, and are very careful, you can rely on identical results on different processors. Otherwise, getting completely identical results is not something you can rely on.

Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

Table of contents:

C/C++
asm
Creating real-life software that achieves this.

In C or C++:

No, a fully ISO C11 and IEEE-conforming C implementation does not guarantee bit-identical results to other C implementations, even other implementations on the same hardware.

(And first of all, I'm going to assume we're talking about normal C implementations where double is the IEEE-754 binary64 format, etc., even though it would be legal for a C implementation on x86 to use some other format for double and implement FP math with software emulation, and define the limits in float.h. That might have been plausible when not all x86 CPUs included with an FPU, but in 2016 that's Deathstation 9000 territory.)

related: Bruce Dawson's Floating-Point Determinism blog post is an answer to this question. His opening paragraph is amusing (and is followed by a lot of interesting stuff):

Is IEEE floating-point math deterministic? Will you always get the same results from the same inputs? The answer is an unequivocal “yes”. Unfortunately the answer is also an unequivocal “no”. I’m afraid you will need to clarify your question.

If you're pondering this question, then you will definitely want to have a look at the index to Bruce's series of articles about floating point math, as implemented by C compilers on x86, and also asm, and IEEE FP in general.

First problem: Only "basic operations": + - * / and sqrt are required to return "correctly rounded" results, i.e. <= 0.5ulp of error, correctly-rounded out to the last bit of the mantissa, so the results is the closest representable value to the exact result.

Other math library functions like pow(), log(), and sin() allow implementers to make a tradeoff between speed and accuracy. For example, glibc generally favours accuracy, and is slower than Apple's OS X math libraries for some functions, IIRC. See also glibc's documentation of the error bounds for every libm function across different architectures.

But wait, it gets worse. Even code that only uses the correctly-rounded basic operations doesn't guarantee the same results.

C rules also allow some flexibility in keeping higher precision temporaries. The implementation defines FLT_EVAL_METHOD so code can detect how it works, but you don't get a choice if you don't like what the implementation does. You do get a choice (with #pragma STDC FP_CONTRACT off) to forbid the compiler from e.g. turning a*b + c into an FMA with no rounding of the a*b temporary before the add.

On x86, compilers targeting 32-bit non-SSE code (i.e. using obsolete x87 instructions) typically keep FP temporaries in x87 registers between operations. This produces the FLT_EVAL_METHOD = 2 behaviour of 80-bit precision. (The standard specifies that rounding still happens on every assignment, but real compilers like gcc don't actually do extra store/reloads for rounding unless you use -ffloat-store. See https://gcc.gnu.org/wiki/FloatingPointMath. That part of the standard seems to have been written assuming non-optimizing compilers, or hardware that efficiently provides rounding to the type width like non-x86, or like x87 with precision set to round to 64-bit double instead of 80-bit long double. Storing after every statement is exactly what gcc -O0 and most other compilers do, and the standard allows extra precision within evaluation of one expression.)

So when targeting x87, the compiler is allowed to evaluate the sum of three floats with two x87 FADD instructions, without rounding off the sum of the first two to a 32-bit float. In that case, the temporary has 80-bit precision... Or does it? Not always, because the C implementation's startup code (or a Direct3D library!!!) may have changed the precision setting in the x87 control word, so values in x87 registers are rounded to 53 or 24 bit mantissa. (This makes FDIV and FSQRT run a bit faster.) All of this from Bruce Dawson's article about intermediate FP precision).

In assembly:

With rounding mode and precision set the same, I think every x86 CPU should give bit-identical results for the same inputs, even for complex x87 instructions like FSIN.

Intel's manuals don't define exactly what those results are for every case, but I think Intel aims for bit-exact backwards compatibility. I doubt they'll ever add extended-precision range-reduction for FSIN, for example. It uses the 80-bit pi constant you get with fldpi (correctly-rounded 64-bit mantissa, actually 66-bit because the next 2 bits of the exact value are zero). Intel's documentation of the worst-case-error was off by a factor of 1.3 quintillion until they updated it after Bruce Dawson noticed how bad the worst-case actually was. But this can only be fixed with extended-precision range reduction, so it wouldn't be cheap in hardware.

I don't know if AMD implements their FSIN and other micro-coded instructions to always give bit-identical results to Intel, but I wouldn't be surprised. Some software does rely on it, I think.

Since SSE only provides instructions for add/sub/mul/div/sqrt, there's nothing too interesting to say. They implement the IEEE operation exactly, so there's no chance that any x86 implementation will ever give you anything different (unless the rounding mode is set differently, or denormals-are-zero and/or flush-to-zero are different and you have any denormals).

SSE rsqrt (fast approximate reciprocal square root) is not exactly specified, and I think it's possible you might get a different result even after a Newton iteration, but other than that SSE/SSE2 is always bit exact in asm, assuming the MXCSR isn't set weird. So the only question is getting the compiler go generate the same code, or just using the same binaries.

In real life:

So, if you statically link a libm that uses SSE/SSE2 and distribute those binaries, they will run the same everywhere. Unless that library uses run-time CPU detection to choose alternate implementations...

As @Yan Zhou points out, you pretty much need to control every bit of the implementation down to the asm to get bit-exact results.

However, some games really do depend on this for multi-player, but often with detection/correction for clients that get out of sync. Instead of sending the entire game state over the network every frame, every client computes what happens next. If the game engine is carefully implemented to be deterministic, they stay in sync.

In the Spring RTS, clients checksum their gamestate to detect desync. I haven't played it for a while, but I do remember reading something at least 5 years ago about them trying to achieve sync by making sure all their x86 builds used SSE math, even the 32-bit builds.

One possible reason for some games not allowing multi-player between PC and non-x86 console systems is that the engine gives the same results on all PCs, but different results on the different-architecture console with a different compiler.

Further reading: GAFFER ON GAMES: Floating Point Determinism. Some techniques that real game engines use to get deterministic results. e.g. wrap sin/cos/tan in non-optimized function calls to force the compiler to leave them at single-precision.

Binary64 floating point addition rounding mode error and behaviors difference 32/64 bits

The FPU registers are 80-bit wide, whenever a single or double precision number is loaded with fld and variants it is converted into the double extended precision by default¹.

Thus fadd usually works with 80-bit numbers.

The SSE registers are format agnostic and the SSE extensions don't support the double extended precision.

For example, addpd works with double precision numbers.

The default rounding mode is round to nearest (even) that means the usual round to nearest but toward the even end in case of a tie (e.g. 4.5 => 4).

To implement the IEEE 754 requirement to perform arithmetic as with an infinite precision numbers, the hardware need two guards bit and a sticky bit²

double

I'll write a double precision number as

<sign> <unbiased exponent in decimal> <implicit integer part> <52-bit mantissa> | <guard bits> <sticky bit>

The two numbers

2.500244140625
4503599627370496

are

+  1 1 0100000000 0010000000 0000000000 0000000000 0000000000 00
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00

The first one is shifted

+ 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1   
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00 |00 0

The sum is done

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1

Rounding to nearest (even) gives

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 11

because 0 |10 1 is closer to 1 |00 0 than 0 |00 0.

double extended

The two numbers are

+  1 1 0100000000 0010000000 0000000000 0000000000 0000000000 0000000000 000
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000

The first is shifted

+ 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000 | 00 0

The sum is done

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0

Rounding to nearest (even):

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000

as 0 | 10 0 is tie broken to the nearest even.

When this number is then converted from double extended precision to double precision (due to a fstp QWORD []) the rounding is repeated using bit 52, 53 and 54 of the double extended mantissa as guards and sticky bits

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10|100

+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10

because 0|100 is again tie broken to the nearest even.

¹ See Chapter 8.5.1.2 of the Intel Manual - Volume 1.

² The guard bit are extra precision bits retained after one of the number is shifted to make the exponents match. The sticky bit it the OR of bits less significant than the the least guard. See the "on Rounding" section of this page and Goldberg for a format approach.

Difference in Floating Point Arithmetics Between X86 and X64