Dealing With Accuracy Problems in Floating-Point Numbers

Dealing with accuracy problems in floating-point numbers

A very simple and effective way to round a floating point number to an integer:

int rounded = (int)(f + 0.5);

Note: this only works if f is always positive. (thanks j random hacker)

How to deal with floating point number precision in JavaScript?

From the Floating-Point Guide:

What can I do to avoid this problem?
That depends on what kind of
calculations you’re doing.
If you really need your results to add up exactly, especially when you
work with money: use a special decimal
datatype.
If you just don’t want to see all those extra decimal places: simply
format your result rounded to a fixed
number of decimal places when
displaying it.
If you have no decimal datatype available, an alternative is to work
with integers, e.g. do money
calculations entirely in cents. But
this is more work and has some
drawbacks.

Note that the first point only applies if you really need specific precise decimal behaviour. Most people don't need that, they're just irritated that their programs don't work correctly with numbers like 1/10 without realizing that they wouldn't even blink at the same error if it occurred with 1/3.

If the first point really applies to you, use BigDecimal for JavaScript, which is not elegant at all, but actually solves the problem rather than providing an imperfect workaround.

Practices to limit floating point accuracy problems

Use fixed-point mathematics where you can deal with a known limited precision.

As an example, the Rockbox music player firmware uses almost entirely fixed-point media codecs.

If you must be perfectly accurate, use an infinite-length storage type like those provided by the GMP library.

If you're just trying to cut down on your errors, try to work as close to zero as possible, where the IEEE FP numbers are more precise. Reorder your operations to avoid letting your absolute values get too large.

Floating point inaccuracy examples

There are basically two major pitfalls people stumble in with floating-point numbers.

The problem of scale. Each FP number has an exponent which determines the overall “scale” of the number so you can represent either really small values or really larges ones, though the number of digits you can devote for that is limited. Adding two numbers of different scale will sometimes result in the smaller one being “eaten” since there is no way to fit it into the larger scale.
```
PS> $a = 1; $b = 0.0000000000000000000000001
PS> Write-Host a=$a b=$b
a=1 b=1E-25
PS> $a + $b
1
```
As an analogy for this case you could picture a large swimming pool and a teaspoon of water. Both are of very different sizes, but individually you can easily grasp how much they roughly are. Pouring the teaspoon into the swimming pool, however, will leave you still with roughly a swimming pool full of water.
(If the people learning this have trouble with exponential notation, one can also use the values 1 and 100000000000000000000 or so.)
Then there is the problem of binary vs. decimal representation. A number like 0.1 can't be represented exactly with a limited amount of binary digits. Some languages mask this, though:
```
PS> "{0:N50}" -f 0.1
0.10000000000000000000000000000000000000000000000000
```
But you can “amplify” the representation error by repeatedly adding the numbers together:
```
PS> $sum = 0; for ($i = 0; $i -lt 100; $i++) { $sum += 0.1 }; $sum
9,99999999999998
```
I can't think of a nice analogy to properly explain this, though. It's basically the same problem why you can represent ¹/₃ only approximately in decimal because to get the exact value you need to repeat the 3 indefinitely at the end of the decimal fraction.
Similarly, binary fractions are good for representing halves, quarters, eighths, etc. but things like a tenth will yield an infinitely repeating stream of binary digits.
Then there is another problem, though most people don't stumble into that, unless they're doing huge amounts of numerical stuff. But then, those already know about the problem. Since many floating-point numbers are merely approximations of the exact value this means that for a given approximation f of a real number r there can be infinitely many more real numbers r₁, r₂, ... which map to exactly the same approximation. Those numbers lie in a certain interval. Let's say that r_min is the minimum possible value of r that results in f and r_max the maximum possible value of r for which this holds, then you got an interval [r_min, r_max] where any number in that interval can be your actual number r.
Now, if you perform calculations on that number—adding, subtracting, multiplying, etc.—you lose precision. Every number is just an approximation, therefore you're actually performing calculations with intervals. The result is an interval too and the approximation error only ever gets larger, thereby widening the interval. You may get back a single number from that calculation. But that's merely one number from the interval of possible results, taking into account precision of your original operands and the precision loss due to the calculation.
That sort of thing is called Interval arithmetic and at least for me it was part of our math course at the university.

How to overcome inaccuracy in Java

You have to take a bit of a zen* approach to floating-point numbers: rather than eliminating the error, learn to live with it.

In practice this usually means doing things like:

when displaying the number, use String.format to specify the amount of precision to display (it'll do the appropriate rounding for you)
when comparing against an expected value, don't look for equality (==). Instead, look for a small-enough delta: Math.abs(myValue - expectedValue) <= someSmallError

EDIT: For infinity, the same principle applies, but with a tweak: you have to pick some number to be "large enough" to treat as infinity. This is again because you have to learn to live with, rather than solve, imprecise values. In the case of something like tan(90 degrees), a double can't store π/2 with infinite precision, so your input is something very close to, but not exactly, 90 degrees -- and thus, the result is something very big, but not quite infinity. You may ask "why don't they just return Double.POSITIVE_INFINITY when you pass in the closest double to π/2," but that could lead to ambiguity: what if you really wanted the tan of that number, and not 90 degrees? Or, what if (due to previous floating-point error) you had something that was slightly farther from π/2 than the closest possible value, but for your needs it's still π/2? Rather than make arbitrary decisions for you, the JDK treats your close-to-but-not-exactly π/2 number at face value, and thus gives you a big-but-not-infinity result.

For some operations, especially those relating to money, you can use BigDecimal to eliminate floating-point errors: you can really represent values like 0.1 (instead of a value really really close to 0.1, which is the best a float or double can do). But this is much slower, and doesn't help you for things like sin/cos (at least with the built-in libraries).

* this probably isn't actually zen, but in the colloquial sense

Managing floating point accuracy

Floating point precision has always generated lots of confusion. The crucial idea to remember is: when you work with doubles, there is no way to store each real number "as is", or "exactly right" -- the best you can store is the closest available approximation. So when you type (in R or any other modern language) something like x = 99.93029, you'll get this number represented by 99.930289999999999.

Now when you expect a + b to be "exactly 100", you're being inaccurate in terms. The best you can get is "100 up to N digits after the decimal point" and hope that N is big enough. In your case it would be correct to say 99.9302900 + 0.0697122 is 100 with 5 decimal points of accuracy. Naturally, by multiplying that equality by 10^k you'll lose additional k digits of accuracy.

So, there are two solutions here:

a. To get more precision in the output, provide more precision in the input.

bb <- c(99.93029, 0.06971) 
print(bb[2]/(100-bb[1])*100, digits = 20)
[1] 99.999999999999119

b. If double precision not enough (can happen in complex algorithms), use packages that provide extra numeric precision operations. For instance, package gmp.

Is floating point math broken?

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

0.1000000000000000055511151231257827021181583404541015625 in decimal, or
0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

0.1 in decimal, or
0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

** Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

Dealing With Accuracy Problems in Floating-Point Numbers