Floating Point Arithmetic Not Producing Exact Results

Floating point arithmetic not producing exact results

If you need exact decimal values, you should use java.math.BigDecimal. Then read "What Every Computer Scientist Should Know About Floating-Point Arithmetic" for the background of why you're getting those results.

(I have a .NET-centric article which you may find easier to read - and certainly shorter. The differences between Java and .NET are mostly irrelevant for the purposes of understanding this issue.)

Is floating point math broken?

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

  • 0.1000000000000000055511151231257827021181583404541015625 in decimal, or
  • 0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

  • 0.1 in decimal, or
  • 0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

** Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

Table of contents:

  • C/C++
  • asm
  • Creating real-life software that achieves this.

In C or C++:

No, a fully ISO C11 and IEEE-conforming C implementation does not guarantee bit-identical results to other C implementations, even other implementations on the same hardware.

(And first of all, I'm going to assume we're talking about normal C implementations where double is the IEEE-754 binary64 format, etc., even though it would be legal for a C implementation on x86 to use some other format for double and implement FP math with software emulation, and define the limits in float.h. That might have been plausible when not all x86 CPUs included with an FPU, but in 2016 that's Deathstation 9000 territory.)


related: Bruce Dawson's Floating-Point Determinism blog post is an answer to this question. His opening paragraph is amusing (and is followed by a lot of interesting stuff):

Is IEEE floating-point math deterministic? Will you always get the same results from the same inputs? The answer is an unequivocal “yes”. Unfortunately the answer is also an unequivocal “no”. I’m afraid you will need to clarify your question.

If you're pondering this question, then you will definitely want to have a look at the index to Bruce's series of articles about floating point math, as implemented by C compilers on x86, and also asm, and IEEE FP in general.


First problem: Only "basic operations": + - * / and sqrt are required to return "correctly rounded" results, i.e. <= 0.5ulp of error, correctly-rounded out to the last bit of the mantissa, so the results is the closest representable value to the exact result.

Other math library functions like pow(), log(), and sin() allow implementers to make a tradeoff between speed and accuracy. For example, glibc generally favours accuracy, and is slower than Apple's OS X math libraries for some functions, IIRC. See also glibc's documentation of the error bounds for every libm function across different architectures.


But wait, it gets worse. Even code that only uses the correctly-rounded basic operations doesn't guarantee the same results.

C rules also allow some flexibility in keeping higher precision temporaries. The implementation defines FLT_EVAL_METHOD so code can detect how it works, but you don't get a choice if you don't like what the implementation does. You do get a choice (with #pragma STDC FP_CONTRACT off) to forbid the compiler from e.g. turning a*b + c into an FMA with no rounding of the a*b temporary before the add.

On x86, compilers targeting 32-bit non-SSE code (i.e. using obsolete x87 instructions) typically keep FP temporaries in x87 registers between operations. This produces the FLT_EVAL_METHOD = 2 behaviour of 80-bit precision. (The standard specifies that rounding still happens on every assignment, but real compilers like gcc don't actually do extra store/reloads for rounding unless you use -ffloat-store. See https://gcc.gnu.org/wiki/FloatingPointMath. That part of the standard seems to have been written assuming non-optimizing compilers, or hardware that efficiently provides rounding to the type width like non-x86, or like x87 with precision set to round to 64-bit double instead of 80-bit long double. Storing after every statement is exactly what gcc -O0 and most other compilers do, and the standard allows extra precision within evaluation of one expression.)

So when targeting x87, the compiler is allowed to evaluate the sum of three floats with two x87 FADD instructions, without rounding off the sum of the first two to a 32-bit float. In that case, the temporary has 80-bit precision... Or does it? Not always, because the C implementation's startup code (or a Direct3D library!!!) may have changed the precision setting in the x87 control word, so values in x87 registers are rounded to 53 or 24 bit mantissa. (This makes FDIV and FSQRT run a bit faster.) All of this from Bruce Dawson's article about intermediate FP precision).


In assembly:

With rounding mode and precision set the same, I think every x86 CPU should give bit-identical results for the same inputs, even for complex x87 instructions like FSIN.

Intel's manuals don't define exactly what those results are for every case, but I think Intel aims for bit-exact backwards compatibility. I doubt they'll ever add extended-precision range-reduction for FSIN, for example. It uses the 80-bit pi constant you get with fldpi (correctly-rounded 64-bit mantissa, actually 66-bit because the next 2 bits of the exact value are zero). Intel's documentation of the worst-case-error was off by a factor of 1.3 quintillion until they updated it after Bruce Dawson noticed how bad the worst-case actually was. But this can only be fixed with extended-precision range reduction, so it wouldn't be cheap in hardware.

I don't know if AMD implements their FSIN and other micro-coded instructions to always give bit-identical results to Intel, but I wouldn't be surprised. Some software does rely on it, I think.


Since SSE only provides instructions for add/sub/mul/div/sqrt, there's nothing too interesting to say. They implement the IEEE operation exactly, so there's no chance that any x86 implementation will ever give you anything different (unless the rounding mode is set differently, or denormals-are-zero and/or flush-to-zero are different and you have any denormals).

SSE rsqrt (fast approximate reciprocal square root) is not exactly specified, and I think it's possible you might get a different result even after a Newton iteration, but other than that SSE/SSE2 is always bit exact in asm, assuming the MXCSR isn't set weird. So the only question is getting the compiler go generate the same code, or just using the same binaries.


In real life:

So, if you statically link a libm that uses SSE/SSE2 and distribute those binaries, they will run the same everywhere. Unless that library uses run-time CPU detection to choose alternate implementations...

As @Yan Zhou points out, you pretty much need to control every bit of the implementation down to the asm to get bit-exact results.

However, some games really do depend on this for multi-player, but often with detection/correction for clients that get out of sync. Instead of sending the entire game state over the network every frame, every client computes what happens next. If the game engine is carefully implemented to be deterministic, they stay in sync.

In the Spring RTS, clients checksum their gamestate to detect desync. I haven't played it for a while, but I do remember reading something at least 5 years ago about them trying to achieve sync by making sure all their x86 builds used SSE math, even the 32-bit builds.

One possible reason for some games not allowing multi-player between PC and non-x86 console systems is that the engine gives the same results on all PCs, but different results on the different-architecture console with a different compiler.

Further reading: GAFFER ON GAMES: Floating Point Determinism. Some techniques that real game engines use to get deterministic results. e.g. wrap sin/cos/tan in non-optimized function calls to force the compiler to leave them at single-precision.

c++ incorrect floating point arithmetic

There are a lot of numbers that computers cannot represent, even if you use float or double-precision float. 1/3, or .3 repeating, is one of those numbers. So it just does the best it can, which is the result you get.

See http://floating-point-gui.de/, or google float precision, there's a ton of info out there (including many SO questions) on this subject.

To answer your questions -- yes, this is an inherent limitation in both the float class and the double class. Some mathematical programs (MathCAD, probably Mathematica) can do "symbolic" math, which allows calculation of the "correct" answers. In many cases, the round-off error can be managed, even over really complex computations, such that the top 6-8 decimal places are correct. However, the opposite is true as well -- naive computations can be constructed that return wildly incorrect answers.

For small problems like division of whole numbers, you'll get a decent number of decimal place accuracy (maybe 4-6 places). If you use double precision floats, that will go up to maybe 8. If you need more... well, I'd start questioning why you want that many decimal places.

Floating point inexact result with not-too-small decimal fraction - is there a way to overcome?

… the result is … show that the first calculation is inexact while the second is exact.

This inference is incorrect. The fact that 1 - .05 - .05 == .9 is false and 1 - .1 == .9 is true shows that the rounding errors that occur in computing 1 - .05 - .05 do not produce a result that equals .9, but that the rounding errors that occur in computing 1 - .1 do produce a result that equals .9. Since the floating-point result of using .9 in source code it itself not .9, the fact that 1 - .1 equals .9 does not mean that 1 - .1 is exact. It just means it got the same rounding error as .9.

In the expression 1 - .1 == .9, there are three rounding errors:

  • When .1 is converted from a decimal numeral in the source code to a binary floating-point value, the result is not .1 but is 0.1000000000000000055511151231257827021181583404541015625.
  • When .9 is converted from a decimal numeral in the source code to a binary floating-point value, the result is 0.90000000000000002220446049250313080847263336181640625.
  • When the former number, 0.1000000000000000055511151231257827021181583404541015625, is subtracted from 1, we can see that, since it is greater than .1, the result ought to be under .9, something like .8999…. But the result is 0.90000000000000002220446049250313080847263336181640625.

It just so happens that the rounding errors produced the same result on both sides, so the test for equality evaluated to true.

In computing 1 - .05 - .05, the rounding errors are:

  • .05 in the source code is converted to 0.05000000000000000277555756156289135105907917022705078125.
  • 1 - .05 has the result 0.9499999999999999555910790149937383830547332763671875.
  • 1 - .05 - .05 has the result 0.899999999999999911182158029987476766109466552734375`.

What has happened here is that, in subtracting 1 - .05, there happened to be a nearby representable value that was slightly less than .95. Thus, when subtracting .05, which we see is slightly greater than .05, we are subtracting something a bit greater than .05 from something a bit less than .95. This combination of errors produces a mathematical result enough under .9 that it is closer to the next lower representable value, 0.899999999999999911182158029987476766109466552734375, than it is to 0.90000000000000002220446049250313080847263336181640625.

When you do similar calculations with long double, the rounding errors happen to be different. When examining floating-point calculations in these ways, long double will not be better than double in the frequency with which computed results happen to land on the same results you would get with ideal mathematics. It may have happened in your example with the numbers you chose, but using other numbers would produce effectively random results, somewhat as if each computation had a random error up or down one step. If those errors happen to have the same net effect in two different computations, then the two results are equal. If those errors happen to have different net effect, then the two results are not equal.

Floating point inaccuracy examples

There are basically two major pitfalls people stumble in with floating-point numbers.

  1. The problem of scale. Each FP number has an exponent which determines the overall “scale” of the number so you can represent either really small values or really larges ones, though the number of digits you can devote for that is limited. Adding two numbers of different scale will sometimes result in the smaller one being “eaten” since there is no way to fit it into the larger scale.

    PS> $a = 1; $b = 0.0000000000000000000000001
    PS> Write-Host a=$a b=$b
    a=1 b=1E-25
    PS> $a + $b
    1

    As an analogy for this case you could picture a large swimming pool and a teaspoon of water. Both are of very different sizes, but individually you can easily grasp how much they roughly are. Pouring the teaspoon into the swimming pool, however, will leave you still with roughly a swimming pool full of water.

    (If the people learning this have trouble with exponential notation, one can also use the values 1 and 100000000000000000000 or so.)

  2. Then there is the problem of binary vs. decimal representation. A number like 0.1 can't be represented exactly with a limited amount of binary digits. Some languages mask this, though:

    PS> "{0:N50}" -f 0.1
    0.10000000000000000000000000000000000000000000000000

    But you can “amplify” the representation error by repeatedly adding the numbers together:

    PS> $sum = 0; for ($i = 0; $i -lt 100; $i++) { $sum += 0.1 }; $sum
    9,99999999999998

    I can't think of a nice analogy to properly explain this, though. It's basically the same problem why you can represent 1/3 only approximately in decimal because to get the exact value you need to repeat the 3 indefinitely at the end of the decimal fraction.

    Similarly, binary fractions are good for representing halves, quarters, eighths, etc. but things like a tenth will yield an infinitely repeating stream of binary digits.

  3. Then there is another problem, though most people don't stumble into that, unless they're doing huge amounts of numerical stuff. But then, those already know about the problem. Since many floating-point numbers are merely approximations of the exact value this means that for a given approximation f of a real number r there can be infinitely many more real numbers r1, r2, ... which map to exactly the same approximation. Those numbers lie in a certain interval. Let's say that rmin is the minimum possible value of r that results in f and rmax the maximum possible value of r for which this holds, then you got an interval [rmin, rmax] where any number in that interval can be your actual number r.

    Now, if you perform calculations on that number—adding, subtracting, multiplying, etc.—you lose precision. Every number is just an approximation, therefore you're actually performing calculations with intervals. The result is an interval too and the approximation error only ever gets larger, thereby widening the interval. You may get back a single number from that calculation. But that's merely one number from the interval of possible results, taking into account precision of your original operands and the precision loss due to the calculation.

    That sort of thing is called Interval arithmetic and at least for me it was part of our math course at the university.

Division is incorect in java

.intValue() will trunc the frarctinal part so you can use Math.ceil(), Math.floor() or you can use Math.round() to approximate it to the nearest value

Integer result = (int) Math.round(new Double(33/(-2*1.1))); //-15
Integer result = (int) Math.floor(new Double(33/(-2*1.1))); //-15
Integer result = (int) Math.ceil(new Double(33/(-2*1.1))); //-14

You can see that Math.ceil() give us 14 because this is a negative number -14>-15 so the ceil of -14.9999 is -14 and the inverse apply on Math.floor()

Why do floats seem to add incorrectly in Java?

Floats are an approximation of the actual number in Java, due to the way they're stored. If you need exact values, use a BigDecimal instead.



Related Topics



Leave a reply



Submit