Understanding Floating Point Problems

Understanding floating point problems

Start by reading What Every Computer Scientist Should Know About Floating Point:
http://docs.sun.com/source/806-3568/ncg_goldberg.html

Short answer: double precision floats (which are the default in JavaScript) have about 16 decimal digits of precision. Rounding can vary from platform to platform. If it is absolutely essential that you get the consistently right answer, you should do rational arithmetic yourself (this doesn't need to be hard - for currency, maybe you can just multiply by 100 to store the number of cents as an integer).

But if it suffices to get the answer with a high degree of precision, floats should be good enough, especially double precision.

Understanding floating point representation errors; what's wrong with my thinking?

If your exponent is decimal (i.e. it represents 10^X), you can precisely represent 0.1 -- however, most floating point formats use binary exponents (i.e. they represent 2^X). Since there are no integers X and Y such that Y * (2 ^ X) = 0.1, you cannot precisely represent 0.1 in most floating point formats.

Some languages have types with both exponents. In C#, for example, there is a data type aptly named decimal which is a floating point format with a decimal exponent so it will support storing a number like 0.1, although it has other uncommon properties: The decimal type can distinguish between 0.1 and 0.10, and it is always true that x + 1 != x for all values of x.

For most common purposes, though, C# also has the float and double floating point types that cannot precisely store 0.1 because they use a binary exponent (as defined in IEEE-754). The binary floating point types use less storage, are faster because they are easier to implement, and have more operations defined on them. In general decimal is only used for financial values where the exact representation of all decimal values is important and the storage, speed, and range of operations are not.

Is floating point math broken?

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

0.1000000000000000055511151231257827021181583404541015625 in decimal, or
0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

0.1 in decimal, or
0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

Floating point inaccuracy examples

There are basically two major pitfalls people stumble in with floating-point numbers.

The problem of scale. Each FP number has an exponent which determines the overall “scale” of the number so you can represent either really small values or really larges ones, though the number of digits you can devote for that is limited. Adding two numbers of different scale will sometimes result in the smaller one being “eaten” since there is no way to fit it into the larger scale.
```
PS> $a = 1; $b = 0.0000000000000000000000001
PS> Write-Host a=$a b=$b
a=1 b=1E-25
PS> $a + $b
1
```
As an analogy for this case you could picture a large swimming pool and a teaspoon of water. Both are of very different sizes, but individually you can easily grasp how much they roughly are. Pouring the teaspoon into the swimming pool, however, will leave you still with roughly a swimming pool full of water.
(If the people learning this have trouble with exponential notation, one can also use the values 1 and 100000000000000000000 or so.)
Then there is the problem of binary vs. decimal representation. A number like 0.1 can't be represented exactly with a limited amount of binary digits. Some languages mask this, though:
```
PS> "{0:N50}" -f 0.1
0.10000000000000000000000000000000000000000000000000
```
But you can “amplify” the representation error by repeatedly adding the numbers together:
```
PS> $sum = 0; for ($i = 0; $i -lt 100; $i++) { $sum += 0.1 }; $sum
9,99999999999998
```
I can't think of a nice analogy to properly explain this, though. It's basically the same problem why you can represent ¹/₃ only approximately in decimal because to get the exact value you need to repeat the 3 indefinitely at the end of the decimal fraction.
Similarly, binary fractions are good for representing halves, quarters, eighths, etc. but things like a tenth will yield an infinitely repeating stream of binary digits.
Then there is another problem, though most people don't stumble into that, unless they're doing huge amounts of numerical stuff. But then, those already know about the problem. Since many floating-point numbers are merely approximations of the exact value this means that for a given approximation f of a real number r there can be infinitely many more real numbers r₁, r₂, ... which map to exactly the same approximation. Those numbers lie in a certain interval. Let's say that r_min is the minimum possible value of r that results in f and r_max the maximum possible value of r for which this holds, then you got an interval [r_min, r_max] where any number in that interval can be your actual number r.
Now, if you perform calculations on that number—adding, subtracting, multiplying, etc.—you lose precision. Every number is just an approximation, therefore you're actually performing calculations with intervals. The result is an interval too and the approximation error only ever gets larger, thereby widening the interval. You may get back a single number from that calculation. But that's merely one number from the interval of possible results, taking into account precision of your original operands and the precision loss due to the calculation.
That sort of thing is called Interval arithmetic and at least for me it was part of our math course at the university.

Understanding how these floating point numbers work?

Per IEEE 754-2008:

NaN: If the exponent field is all ones and the significand field is not zero, the floating-point datum is a NaN, regardless of the sign field. Preferably, a QNaN has the leading bit of the significand field 1 and a signaling NaN has 0, but this is not required.
Infinite: If the exponent field is all ones and the significand field is zero, the datum is (−1)^s • ∞, where s is the sign field. (I.e., +∞ if the sign is 0 and −∞ if the sign is 1.)
Normal: If the exponent field is neither all zeros nor all ones, the datum is (−1)^s • (1 + f • 2^−q) • 2^{e - bias}, where s is the sign field, f is the significand field, q is the number of bits in the significand field, e is the exponent field, and bias is the exponent bias (127 for 32-bit floating-point).
Subnormal: If the exponent field is all zeros, and the significand field is not, the datum is (−1)^s • (0 + f • 2^−q) • 2^{1 - bias}. Note the two differences from normal: 0 is added to the significand instead of 1, and 1 is used for the exponent (before subtracting bias). This means subnormals have the same exponent as the smallest normals but are decreased by reducing the significand.
Zero: If the exponent field is all zeroes, and the significand field is also all zeros, the datum is (−1)^s • 0. (Note that IEEE 754 distinguishes +0 and −0.)

The exponent used with subnormals is 1 rather than 0 so that the numbers change from (normal) 1.000…000•2¹⁻¹²⁷ to (subnormal) 0.111…111•2¹⁻¹²⁷. If 0 were used, there would be a jump to 0.0111…1111•2¹⁻¹²⁷.

The formula for the values of subnormals works for zeros too. So zeros do not actually need to be listed separately above.

c - understanding floating point binary model

So the teacher represents 284 as 100011100 = 1.000111 x 2^8. I get that the bit sign is 0 because it's a positive number. I have no idea where the 8-bit exponent of 00001000 came from,…

The exponent in of 2 in 1.000111 × 2⁸ is 8. 8 in binary is 1000, or 00001000.

Later, 127 is added to the exponent. This is just a matter of how the exponent is stored. Instead of any other method of representing positive a negative exponents, it is just a rule that 127 is added to the exponent before storing it. So if the exponent is 8 (00001000), we add 127 to get 135 (10000111) and store that in the exponent field. This gives us a way of storing negative exponents. If the exponent is −1, we store −1 + 127 = 126. If the exponent is −126, we store −126 + 127 = 1.

I would also appreciate if someone could explain how the teacher went from having the values in the first row to changing them to an 8-bit exponent of 10000111 and a 23-bit mantissa of 000 1110 0000 0000 0000 0000 as shown in the 3rd row.

For normal numbers, we remove the first bit from the significand¹ and store the next 23 bits in the significand field. So, with the significand 1.000111, we remove the leading 1 to get .000111, and then we store 000111 followed by zeros. (A normal number is any representable number at or above the minimum exponent scale for the format, which is 2⁻¹²⁶ for the IEEE-754 32-bit binary format. For subnormal numbers, the leading bit is stored explicitly, with a modification to how the exponent is handled.)

Footnote

¹ “Significand” is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old term for the fraction portion of a logarithm. Significands are linear. Mantissas are logarithmic.

understanding floating point representation of 2^x

See the definition of a single-precision floating point number.

The range of -126 to 127 is the exponent that can be encoded in the format. Anything less than -126 will be denormalized because there won't be enough bits in the fraction part, and anything greater than 127 (i.e. >= 128) can't be represented at all.

There are 23 bits available in the fraction part, so a denormalized value can be between 2^(-126 - 1) and 2^(-126 - 23). -126 - 23 = -149.

Yes the values will change with a 64-bit floating point number - the exponent ranges from -1022 to 1023, and there are 52 bits in the fraction.

Understanding Floating Point Problems