Floating Point Math in Different Programming Languages

Floating point math in different programming languages

All these languages are using the system-provided floating-point format, which represents values in binary rather than in decimal. Values like 0.2 and 0.4 can't be represented exactly in that format, so instead the closest representable value is stored, resulting in a small error. For example, the numeric literal 0.2 results in a floating-point number whose exact value is 0.200000000000000011102230246251565404236316680908203125. Similarly, any given arithmetic operation on floating-point numbers may result in a value that's not exactly representable, so the true mathematical result is replaced with the closest representable value. These are the fundamental reasons for the errors you're seeing.

However, this doesn't explain the differences between languages: in all of your examples, the exact same computations are being made and the exact same results are being arrived at. The difference then lies in the way that the various languages choose to display the results.

Strictly speaking, none of the answers you show is correct. Making the (fairly safe) assumption of IEEE 754 binary 64 arithmetic with a round-to-nearest rounding mode, the exact value of the first sum is:

0.600000000000000088817841970012523233890533447265625

while the exact value of the second sum is:

0.59999999999999997779553950749686919152736663818359375

However, neither of those outputs is particularly user-friendly, and clearly all of the languages you tested made the sensible decision to abbreviate the output when printing. However, they don't all adopt the same strategy for formatting the output, which is why you're seeing differences.

There are many possible strategies for formatting, but three particularly common ones are:

  1. Compute and display 17 correctly-rounded significant digits, possibly stripping trailing zeros where they appear. The output of 17 digits guarantees that distinct binary64 floats will have distinct representations, so that a floating-point value can be unambiguously recovered from its representation; 17 is the smallest integer with this property. This is the strategy that Python 2.6 uses, for example.

  2. Compute and display the shortest decimal string that rounds back to the given binary64 value under the usual round-ties-to-even rounding mode. This is rather more complicated to implement than strategy 1, but preserves the property that distinct floats have distinct representations, and tends to make for pleasanter output. This appears to be the strategy that all of the languages you tested (besides R) are using.

  3. Compute and display 15 (or fewer) correctly-rounded significant digits. This has the effect of hiding the errors involved in the decimal-to-binary conversions, giving the illusion of exact decimal arithmetic. It has the drawback that distinct floats can have the same representation. This appears to be what R is doing. (Thanks to @hadley for pointing out in the comments that there's an R setting which controls the number of digits used for display; the default is to use 7 significant digits.)

Floating point accuracy with different languages

Generally, the representation of floating point numbers is defined by the standard IEEE 754 and my assumption is that this standard is implemented by all (major) programming languages.

Precision and rounding are known issues and may sometimes lead to unexpected results.

Aspects that may have an influence on the result of a calculation depending on the programming language or used math library:

  • different calculation methods (in your case: the cosine function
    might be implemented by numerical approximation with different approaches)
  • different rounding strategies during calculation or for the final output

Which programming languages have arbitrary precision floating-point literals?

Haskell has Rationals backed by arbitrary precision Integers, and it has overloaded numeric literals, so that most often your literals have exactly the type you want.

Is floating point math broken?

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

  • 0.1000000000000000055511151231257827021181583404541015625 in decimal, or
  • 0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

  • 0.1 in decimal, or
  • 0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

Why do floating point operations display different in some languages?

@ouah establishes that the languages are all behaving the same. My answer aims to explain why they appear different. The only two languages that have "different" output are C and Python.

Clearly, every language besides C and Python is just printing out the float value to as many decimal places as it can.

C is easy to explain. You use printf("%f", result), without specifying an explicit precision value. Per the C standard, the precision of the f specifier defaults to 6. Thus, exactly six decimal places are printed out, which is what you see. As @ouah notes, setting the precision to 18 will yield the "expected" output. This is lossy: doubles that differ past the 7th decimal place will be printed out identically, and so the output of %f cannot be relied on to exactly reconstruct the original float.

Python is a bit trickier. Python 3.1 introduced a new floating-point repr algorithm, based on work by David Gay. The Python issue corresponding to the feature is here: http://bugs.python.org/issue1580. This feature was backported to Python 2.7 as well.

The intention of this new feature was to both reduce confusion over floating point (though that is dubiously useful), and more importantly to provide more human-readable, shorter representations of floating point numbers without affecting round-trip behaviour; that is, float(repr(x)) is always equal to x, even if repr(x) is shortened due to this algorithm. So, the algorithm manages to produce a shorter floating-point representation while remaining "lossless": win-win!

The official description says this much:

The new algorithm for repr(1.1) is smarter and returns '1.1'. Effectively, it searches all equivalent string representations (ones that get stored with the same underlying float value) and returns the shortest representation.

strange behaviour of Programming language of simple maths. operations

Why don’t my numbers, like 0.1 + 0.2 add up to a nice round 0.3, and instead I get a weird result like 0.30000000000000004?

Because internally, computers use a format (binary floating-point) that cannot accurately represent a number like 0.1, 0.2 or 0.3 at all.

When the code is compiled or interpreted, your “0.1” is already rounded to the nearest number in that format, which results in a small rounding error even before the calculation happens.

Decimal numbers cannot accurately represent a number like 1/3, so you have to round to something like 0.33 - and you don’t expect 0.33 + 0.33 + 0.33 to add up to 1, either - do you?

Computers use binary numbers because they’re faster at dealing with those, and because for most calculations, a tiny error in the 17th decimal place doesn’t matter at all since the numbers you work with aren’t round (or that precise) anyway.

http://floating-point-gui.de/basic/

This should help

Double precision is different in different languages

The differences in output are due to differences in converting the floating-point number to a numeral. (By numeral, I mean a character string or other text that represents a number. “20”, “20.0”, “2e+1”, and “2•102” are different numerals for the same number.)

For reference, I show the exact values of i in notes below.

In C, the %.17lf conversion specification you use requested 17 digits after the decimal point, so 17 digits after the decimal point are produced. However, the C standard allows some slack in this. It only requires calculation of enough digits that the actual internal value can be distinguished.1 The rest can be filled in with zeros (or other “incorrect” digits). It appears the C standard library you are using only fully calculates 17 significant digits and fills the rest you request with zeros. This explains why you got “2.90000000000000120” instead of “2.90000000000000124”. (Note that “2.90000000000000120” has 18 digits: 1 before the decimal point, 16 significant digits after it, and 1 non-significant “0”. “0.10000000000000001” has an aesthetic “0” before the decimal point and 17 significant digits after it. The requirement for 17 significant digits is why ““0.10000000000000001” must have the “1” at the end but “2.90000000000000120” may have a “0”.)

In contrast, it appears your C++ standard library does the full calculations, or at least more (which may be due to a rule in the C++ standard2), so you get “2.90000000000000124”.

Python 3.1 added an algorithm to convert with the same result as Java (see below). Prior to that was lax about the conversion for display. (To my knowledge, it is still lax about the floating-point format used and conformance to IEEE-754 in arithmetic operations; specific Python implementations may differ in behavior.)

Java requires that the default conversion from double to string produce just as many digits as are required to distinguish the number from neighboring double values (also here). So it produces “.2” instead of “0.20000000000000001” because the the double nearest .2 is the value that i had in that iteration. In contrast, in the next iteration, the rounding errors in arithmetic gave i a value slightly different from the double nearest .3, so Java produced “0.30000000000000004” for it. In the next iteration, the new rounding error happened to partially cancel the accumulated error, so it was back to “0.4”.

Notes

The exact values of i when IEEE-754 binary64 is used are:


0
0.1000000000000000055511151231257827021181583404541015625
0.200000000000000011102230246251565404236316680908203125
0.3000000000000000444089209850062616169452667236328125
0.40000000000000002220446049250313080847263336181640625
0.5
0.59999999999999997779553950749686919152736663818359375
0.6999999999999999555910790149937383830547332763671875
0.79999999999999993338661852249060757458209991455078125
0.899999999999999911182158029987476766109466552734375
0.99999999999999988897769753748434595763683319091796875
1.0999999999999998667732370449812151491641998291015625
1.1999999999999999555910790149937383830547332763671875
1.3000000000000000444089209850062616169452667236328125
1.4000000000000001332267629550187848508358001708984375
1.5000000000000002220446049250313080847263336181640625
1.6000000000000003108624468950438313186168670654296875
1.7000000000000003996802888650563545525074005126953125
1.8000000000000004884981308350688777863979339599609375
1.9000000000000005773159728050814010202884674072265625
2.000000000000000444089209850062616169452667236328125
2.10000000000000053290705182007513940334320068359375
2.200000000000000621724893790087662637233734130859375
2.300000000000000710542735760100185871124267578125
2.400000000000000799360577730112709105014801025390625
2.50000000000000088817841970012523233890533447265625
2.600000000000000976996261670137755572795867919921875
2.7000000000000010658141036401502788066864013671875
2.800000000000001154631945610162802040576934814453125
2.90000000000000124344978758017532527446746826171875

These are not all the same values you would get by converting 0, .1, .2, .3,… 2.9 from decimal to binary64 because they are produced by arithmetic, so there are multiple rounding errors from the initial conversions and the consecutive additions.

Footnotes

1 C 2018 7.21.6.1 only requires that the resulting numeral be accurate to DECIMAL_DIG digits in a specified sense. DECIMAL_DIG is the number of digits such that, for any number in any floating-point format in the implementation, converting it to a decimal number with DECIMAL_DIG significant digits and then back to floating-point yields the original value. If IEEE-754 binary64 is the most precise format your implementation supports, then its DECIMAL_DIG is at least 17.

2 I do not see such a rule in the C++ standard, other than incorporation of the C standard, so it may be that your C++ library is simply using a different method from your C library as a matter of choice.

A little diversion into floating point (im)precision, part 1

It's not that most floating point implementations disagree, it's just that they cannot get the accuracy necessary to get a 100% answer. And the correct answer is that they can't.

PI is an infinite series of digits that nobody has been able to denote by anything other than a symbolic representation, and e^X is the same, and thus the only way to get to 100% accuracy is to go symbolic.

How does Lua's floating-point handling differ from other languages?

0.1+0.2 is not 0.3 exactly. Try this code:

print((0.1+0.2)==0.3)
print(string.format("%.17g",0.1+0.2))

I assume you're using print or io.write to print those values. In this case, Lua is just not printing all digits. Internally, Lua uses full-length, native floating-point representation. The technical explanation is that print and io.write format numbers using the format in LUA_NUMBER_FMT defined in luaconf.h, which by default is "%.14g".

Slightly different floating point math results (C to golang)

I think the algorithms are different. For example variance:

/* Do the MA calculation using tight loops. */
/* Add-up the initial periods, except for the last value. */
periodTotal1 = 0;
periodTotal2 = 0;
trailingIdx = startIdx-nbInitialElementNeeded;

i=trailingIdx;
if( optInTimePeriod > 1 )
{
while( i < startIdx ) {
tempReal = inReal[i++];
periodTotal1 += tempReal;
tempReal *= tempReal;
periodTotal2 += tempReal;
}
}

/* Proceed with the calculation for the requested range.
* Note that this algorithm allows the inReal and
* outReal to be the same buffer.
*/
outIdx = 0;
do
{
tempReal = inReal[i++];

/* Square and add all the deviation over
* the same periods.
*/

periodTotal1 += tempReal;
tempReal *= tempReal;
periodTotal2 += tempReal;

/* Square and add all the deviation over
* the same period.
*/

meanValue1 = periodTotal1 / optInTimePeriod;
meanValue2 = periodTotal2 / optInTimePeriod;

tempReal = inReal[trailingIdx++];
periodTotal1 -= tempReal;
tempReal *= tempReal;
periodTotal2 -= tempReal;

outReal[outIdx++] = meanValue2-meanValue1*meanValue1;
} while( i <= endIdx );

That doesn't look like your variance. If you were to reproduce the algorithms so that they did the exact same operations then the Go version should produce the same result. Go is just doing standard, IEEE 754 floating point arithmetic.

As to the question "does order matter?" It definitely does. Since floating point arithmetic is inexact you will lose information as you do the calculations. Most of the time it doesn't make much of a difference, but sometimes algorithms can be very susceptible to these changes. (so rearranging your formula algebraically may not lead to the same answer in real code)

You often find in libraries like these that algorithms have been designed to account for these issues and so they often don't look like the naive implementation. For example mean is usually a trivial function, but here's how its calculated in the GSL:

double
FUNCTION (gsl_stats, mean) (const BASE data[], const size_t stride, const size_t size)
{
/* Compute the arithmetic mean of a dataset using the recurrence relation
mean_(n) = mean(n-1) + (data[n] - mean(n-1))/(n+1) */

long double mean = 0;
size_t i;

for (i = 0; i < size; i++)
{
mean += (data[i * stride] - mean) / (i + 1);
}

return mean;
}

So unless you match the algorithms exactly your answers will be subtly different. (which doesn't necessarily mean you're program is wrong)

One solution often used for this is to do equality comparisons within a very small number (math.Abs(expected-result) < ɛ, where you define ɛ: const ɛ = 0.0000001) rather than using ==.



Related Topics



Leave a reply



Submit