Float Result Is Inaccurate

inaccurate results for calculations using floats - Simple solution

Three points:

  1. the function in the question/general method proposed, while it does avoid the problem in many cases, there are many other cases, even relatively simple ones, where it has the same problem.
  2. there is a decimal module which always provides accurate answers (even when the justwork() function in the question fails to)
  3. using the decimal module slows things down considerably - taking roughly 100 times longer. The default approach sacrifices accuracy to prioritise speed. [Whether making this the default is the right approach is debatable].

To illustrate these three points consider the following functions, loosely based on that in the question:

def justdoesntwork(x,operator,y):
numx = numy = 0
if "." in str(x):
numx = len(str(x)) - str(x).find(".") -1
if "." in str(y):
numy = len(str(y)) - str(y).find(".") -1
factor = 10 ** max(numx,numy)
newx = x * factor
newy = y * factor

if operator == "+": myAns = (newx + newy) / factor
elif operator == "-": myAns = (newx - newy) / factor
elif operator == "*": myAns = (newx * newy) / (factor**2)
elif operator == "/": myAns = (newx / newy)
elif operator == "//": myAns = (newx //newy)
elif operator == "%": myAns = (newx % newy) / factor

return myAns

and

from decimal import Decimal
def doeswork(x,operator,y):
if operator == "+": decAns = Decimal(str(x)) + Decimal(str(y))
elif operator == "-": decAns = Decimal(str(x)) - Decimal(str(y))
elif operator == "*": decAns = Decimal(str(x)) * Decimal(str(y))
elif operator == "/": decAns = Decimal(str(x)) / Decimal(str(y))
elif operator == "//": decAns = Decimal(str(x)) //Decimal(str(y))
elif operator == "%": decAns = Decimal(str(x)) % Decimal(str(y))

return decAns

and then looping through many values to find where myAns is different to decAns:

operatorlist = ["+", "-", "*", "/", "//", "%"]
for a in range(1,1000):
x = a/10
for b in range(1,1000):
y=b/10
counter = 0
for operator in operatorlist:
myAns, decAns = justdoesntwork(x, operator, y), doeswork(x, operator, y)
if (float(decAns) != myAns) and len(str(decAns)) < 5 :
print(x,"\t", operator, " \t ", y, " \t= ", decAns, "\t\t{", myAns, "}")

=> this goes through all values to 1 d.p. from 0.1 to 99.9 - and indeed fails to find any values where myAns is different to decAns.

However if it is changed to give 2d.p. (i.e. either x = a/100 or y = b/100), then many examples appear. For example, 0.1+1.09 - this can easily be checked by typing in the console ((0.1*100)+(1.09*100)) / (100), which uses the basic method of the question, and which returns 1.1900000000000002 instead of 1.19. The source of the error is in 1.09*100 which returns 109.00000000000001. [Simply typing in 0.1+1.09 also gives the same error]. So the approach suggested in the question doesn't always work.

Using Decimal() however returns the correct answer: Decimal('0.1')+Decimal('1.09') returns Decimal('1.19').

[Note: Don't forget to enclose the 0.1 and 1.09 with quotes. If you don't, Decimal(0.1)+Decimal(1.09) returns Decimal('1.190000000000000085487172896') - because it starts with a float 0.1 which is stored inaccurately, and then converts that to Decimal - GIGO. Decimal() has to be fed a string. Taking a float, converting it to a string, and from there to Decimal, does seem to work though, the problem is only when going directly from float to Decimal].


In terms of time cost, run this:

import timeit
operatorlist = ["+", "-", "*", "/", "//", "%"]

for operator in operatorlist:
for a in range(1,10):
a=a/10
for b in range(1,10):
b=b/10

DECtime = timeit.timeit("Decimal('" +str(a)+ "') " +operator+ " Decimal('" +str(b)+ "')", setup="from decimal import Decimal")
NORMtime = timeit.timeit(str(a) +operator+ str(b))
timeslonger = DECtime // NORMtime
print("Operation: ", str(a) +operator +str(b) , "\tNormal operation time: ", NORMtime, "\tDecimal operation time: ", DECtime, "\tSo Decimal operation took ", timeslonger, " times longer")

This shows that Decimal operations consistently take around 100 times longer, for all the operators tested.

[Including exponentiation in the list of operators shows that exponentiation can take 3000 - 5000 times longer. However this is partly because Decimal() evaluates to far greater precision than normal operations - Decimal() default precision is 28 places - Decimal("1.5")**Decimal("1.5") returns 1.837117307087383573647963056, whereas 1.5**1.5 returns 1.8371173070873836. If you limit b to whole numbers by replacing b=b/10 with b=float(b) (which will prevent results with high SFs), the Decimal calculation takes around 100 times longer, as with other operators].


It could still be argued that the time cost is only significant for users performing billions of calculations, and most users would prioritise getting intelligible results over a time difference which is pretty insignificant in most modest applications.

Is floating point math broken?

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

  • 0.1000000000000000055511151231257827021181583404541015625 in decimal, or
  • 0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

  • 0.1 in decimal, or
  • 0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

Why am I getting the wrong result when using float?

I guess you are refering to deviations cause by Floating point arithmetics. You can read about in in the provided link.

If you really need to make the calculation 100% accurate, you can use decimal instead of float.

Floating point inaccuracy examples

There are basically two major pitfalls people stumble in with floating-point numbers.

  1. The problem of scale. Each FP number has an exponent which determines the overall “scale” of the number so you can represent either really small values or really larges ones, though the number of digits you can devote for that is limited. Adding two numbers of different scale will sometimes result in the smaller one being “eaten” since there is no way to fit it into the larger scale.

    PS> $a = 1; $b = 0.0000000000000000000000001
    PS> Write-Host a=$a b=$b
    a=1 b=1E-25
    PS> $a + $b
    1

    As an analogy for this case you could picture a large swimming pool and a teaspoon of water. Both are of very different sizes, but individually you can easily grasp how much they roughly are. Pouring the teaspoon into the swimming pool, however, will leave you still with roughly a swimming pool full of water.

    (If the people learning this have trouble with exponential notation, one can also use the values 1 and 100000000000000000000 or so.)

  2. Then there is the problem of binary vs. decimal representation. A number like 0.1 can't be represented exactly with a limited amount of binary digits. Some languages mask this, though:

    PS> "{0:N50}" -f 0.1
    0.10000000000000000000000000000000000000000000000000

    But you can “amplify” the representation error by repeatedly adding the numbers together:

    PS> $sum = 0; for ($i = 0; $i -lt 100; $i++) { $sum += 0.1 }; $sum
    9,99999999999998

    I can't think of a nice analogy to properly explain this, though. It's basically the same problem why you can represent 1/3 only approximately in decimal because to get the exact value you need to repeat the 3 indefinitely at the end of the decimal fraction.

    Similarly, binary fractions are good for representing halves, quarters, eighths, etc. but things like a tenth will yield an infinitely repeating stream of binary digits.

  3. Then there is another problem, though most people don't stumble into that, unless they're doing huge amounts of numerical stuff. But then, those already know about the problem. Since many floating-point numbers are merely approximations of the exact value this means that for a given approximation f of a real number r there can be infinitely many more real numbers r1, r2, ... which map to exactly the same approximation. Those numbers lie in a certain interval. Let's say that rmin is the minimum possible value of r that results in f and rmax the maximum possible value of r for which this holds, then you got an interval [rmin, rmax] where any number in that interval can be your actual number r.

    Now, if you perform calculations on that number—adding, subtracting, multiplying, etc.—you lose precision. Every number is just an approximation, therefore you're actually performing calculations with intervals. The result is an interval too and the approximation error only ever gets larger, thereby widening the interval. You may get back a single number from that calculation. But that's merely one number from the interval of possible results, taking into account precision of your original operands and the precision loss due to the calculation.

    That sort of thing is called Interval arithmetic and at least for me it was part of our math course at the university.

How to overcome inaccuracy in Java

You have to take a bit of a zen* approach to floating-point numbers: rather than eliminating the error, learn to live with it.

In practice this usually means doing things like:

  • when displaying the number, use String.format to specify the amount of precision to display (it'll do the appropriate rounding for you)
  • when comparing against an expected value, don't look for equality (==). Instead, look for a small-enough delta: Math.abs(myValue - expectedValue) <= someSmallError

EDIT: For infinity, the same principle applies, but with a tweak: you have to pick some number to be "large enough" to treat as infinity. This is again because you have to learn to live with, rather than solve, imprecise values. In the case of something like tan(90 degrees), a double can't store π/2 with infinite precision, so your input is something very close to, but not exactly, 90 degrees -- and thus, the result is something very big, but not quite infinity. You may ask "why don't they just return Double.POSITIVE_INFINITY when you pass in the closest double to π/2," but that could lead to ambiguity: what if you really wanted the tan of that number, and not 90 degrees? Or, what if (due to previous floating-point error) you had something that was slightly farther from π/2 than the closest possible value, but for your needs it's still π/2? Rather than make arbitrary decisions for you, the JDK treats your close-to-but-not-exactly π/2 number at face value, and thus gives you a big-but-not-infinity result.

For some operations, especially those relating to money, you can use BigDecimal to eliminate floating-point errors: you can really represent values like 0.1 (instead of a value really really close to 0.1, which is the best a float or double can do). But this is much slower, and doesn't help you for things like sin/cos (at least with the built-in libraries).

* this probably isn't actually zen, but in the colloquial sense

Difference between double and float in floating point accuracy

The Cases of 0.8−0.7

In 0.8-0.7 == 0.1, none of the literals are exactly representable in double. The nearest representable values are 0.8000000000000000444089209850062616169452667236328125 for .8, 0.6999999999999999555910790149937383830547332763671875 for .7, and 0.1000000000000000055511151231257827021181583404541015625 for .1. When the first two are subtracted, the result is 0.100000000000000088817841970012523233890533447265625. As this is not equal to the third, 0.8-0.7 == 0.1 evaluates to false.

In (float)(0.8-0.7) == (float)(0.1), the result of 0.8-0.7 and 0.1 are each converted to float. The float value nearest to the former, 0.1000000000000000055511151231257827021181583404541015625, is 0.100000001490116119384765625. The float value nearest to the latter, 0.100000000000000088817841970012523233890533447265625, is 0.100000001490116119384765625. Since these are the same, (float)(0.8-0.7) == (float)(0.1) evaluates to true.

In (double)(0.8-0.7) == (double)(0.1), the result of 0.8-0.7 and 0.1 are each converted to double. Since they are already double, there is no effect, and the result is the same as for 0.8-0.7 == 0.1.

Notes

The C# specification, version 5.0 indicates that float and double are the IEEE-754 32-bit and 64-bit floating-point types. I do not see it explicitly state they are the binary floating-point formats rather than decimal formats, but the characteristics described make this evident. The specification also states that IEEE-754 arithmetic is generally used, with round-to-nearest (presumably round-to-nearest-ties-to-even), subject to the exception below.

The C# specification allows floating-point arithmetic to be performed with more precision than the nominal type. Clause 4.1.6 says “… Floating-point operations may be performed with higher precision than the result type of the operation…” This can complicate analysis of floating-point expressions in general, but it does not concern us in the instance of 0.8-0.7 == 0.1 because the only applicable operation is the subtraction of 0.7 from 0.8, and these numbers are in the same binade (have the same power of two in the floating-point representation), so the result of the subtraction is exactly representable and additional precision will not change the result. As long as the conversion of the source texts 0.8, 0.7, and 0.1 to double does not use extra precision and the cast to float produces a float with no extra precision, the results will be as stated above. (The C# standard says in clause 6.2.1 that a conversion from double to float yields a float value, although it does not explicitly state that no extra precision may be used at this point.)

Additional Cases

In 8-0.7 == 7.3, we have 8 for 8, 7.29999999999999982236431605997495353221893310546875 for 7.3, 0.6999999999999999555910790149937383830547332763671875 for 0.7, and 7.29999999999999982236431605997495353221893310546875 for 8-0.7, so the result is true.

Note that the additional precision allowed by the C# specification could affect the result of 8-0.7. A C# implementation that used extra precision for this operation could produce false for this case, as it would get a different result for 8-0.7.

In 18.01-0.7 == 17.31, we have 18.010000000000001563194018672220408916473388671875 for 18.01, 0.6999999999999999555910790149937383830547332763671875 for 0.7, 17.309999999999998721023075631819665431976318359375 for 17.31, and 17.31000000000000227373675443232059478759765625 for 18.01-0.7, so the result is false.

How is subtracting 8 difference from subtracting 18.01 if they both are subtracted by a floating point number?

18.01 is larger than 8 and requires a greater power of two in its floating-point representation. Similarly, the result of 18.01-0.7 is larger than that of 8-0.7. This means the bits in their significands (the fraction portion of the floating-point representation, which is scaled by the power of two) represent greater values, causing the rounding errors in the floating-point operations to be generally greater. In general, a floating-point format has a fixed span—there is a fixed distance from the high bit retained to the low bit retained. When you change to numbers with more bits on the left (high bits), some bits on the right (low bits) are pushed out, and the results change.



Related Topics



Leave a reply



Submit