Why Does Adding 0 to the End of Float Literal Change How It Rounds (Possible Gcc Bug)

Why does adding 0 to the end of float literal change how it rounds (possible GCC bug)?

Your problem is that long double on your platform has insufficient precision to store the exact value 0.99999999999999999999. This means that the value of that must be converted to a representable value (this conversion happens during translation of your program, not at runtime).

This conversion can generate either the nearest representable value, or the next greater or smaller representable value. The choice is implementation-defined, so your implementation should document which it is using. It seems that your implementation uses x87-style 80bit long double, and is rounding to the nearest value, resulting in a value of 1.0 stored in x.


With the assumed format for long double (with 64 mantissa bits), the highest representable number less than 1.0 is, in hexadecimal:

0x0.ffffffffffffffff

The number exactly halfway between this value and the next higher representable number (1.0) is:

0x0.ffffffffffffffff8

Your very long constant 0.9999999999999999999728949456878623891498136799780 is equal to:

0x0.ffffffffffffffff7fffffffffffffffffffffffa1eb2f0b64cf31c113a8ec...

which should obviously be rounded down if rounding to nearest, but you appear to have reached some limit of the floating point representation your compiler is using, or a rounding bug.

Why does adding a small float to a large float just drop the small one?

32-bit floats only have 24 bits of precision. Thus, a float cannot hold b exactly - it does the best job it can by setting some exponent and mantissa to get as close as possible1. (The nearest representable float to the constant in the source; the default FP rounding mode is "nearest".)

When you then consider the floating point representation of b and a, and try and add them, the addition operation will shift the small number a's mantissa downwards as it tries to match b's exponent, to the point where the value (3) falls off the end and you're left with 0. Hence, the addition operator ends up adding floating point zero to b. (This is an over-simplification; low bits can still affect rounding if there's partial overlap of mantissas.)

In general, the infinite-precision addition result has to get rounded to the nearest float with the current FP rounding mode, and that happened to be equal to b.

See also Why adding big to small in floating point introduce more error? for cases where the number changes some, but with large rounding error, for an example using decimal significant figures as a way to help understand binary float rounding.


Footnote 1: For numbers that large, the nearest two floats are 32 apart. Modern clang even warns about rounding of an int in the source to a float that represents a different value. Unless you write it as a float or double constant already (like 299792458.0f), in which case the rounding happens without warning.

That's why the smallest a value that will round sum up to 299792480.0f instead of down to 299792448.0f is about 16.000001 for that b value which rounded to 299792448.0f. Runnable example on the Godbolt compiler explorer.

The default FP rounding mode rounds to nearest with even mantissa as a tie-break. 16.0 goes exactly half-way, and thus round to a bit-pattern of 0x4d8ef3c2, not up to 0x4d8ef3c3. https://www.h-schmidt.net/FloatConverter/IEEE754.html. Anything slightly greater than 16 rounds up, because rounding cares about the infinite-precision result. It doesn't actually shift out bits before adding, that was an over-simplification. The nearest float to 16.000001 has only the low bit set in its mantissa, bit-pattern 0x41800001. It's actually about 1.0000001192092896 x 24, or 16.0000019... Much smaller and it would round to exactly 16.0f and would be <= 1 ULP (unit in the last place) of b, which wouldn't change b because b's mantissa is already even.


If you avoid early rounding by using double a,b, the smallest value you can add that rounds up 299792480.0f instead of down to 299792448.0f when you do float sum = a+b is about a=6.0000001;, which makes sense because the integer value ...58 stays as ...58.0 instead of rounding down to ...48.0f, i.e. the rounding error in float b = ...58 was -10, so a can be that much smaller.

There are two rounding steps this time, though, with a+b rounding to the nearest double if that addition isn't exact, then that double rounding to a float. (Or if FLT_EVAL_METHOD == 2, like C compiling for 80-bit x87 floating point on 32-bit x86, the + result would round to to 80-bit long double, then to float.)

Floating point rounding in C

1.5 is a double constant rather than a float and C has automatic promotion rules. So when you perform 1.5*t what happens is (i) t is converted to a double; (ii) that double is multiplied by the double 1.5; and (iii) the double is printed (as %f is the formatter for a double).

Conversely, t *= 1.5 promotes t to a double, performs a double multiplication and then truncates the result to store it back into a [single precision] float.

For evidence, try either:

float t;
t = 5592411;
printf("%f\n", 1.5f*t); // multiply a float by a float, for no promotion
t *= 1.5;
printf("%f\n", t);
return 0;

Or:

double t; // store our intermediate results in a double
t = 5592411;
printf("%f\n", 1.5f*t);
t *= 1.5;
printf("%f\n", t);
return 0;

double numbers precision in C

The best way to answer your question is by changing the line of code slightly:

printf("is %.16e <= %.16e ? %s", 0.5, 0.49999999999999999, 0.5 <= 0.49999999999999999 ? "true":"false" );

Compiling this line, you will see that 0.49999999999999999 is rounded to the nearest representable double, which is the same as for 0.5 (and indeed represents the exact value 1/2). This explains why one is lower than or equal to the other: they are two different notations for the same thing.


Note: I chose the format %.16e as this format has the property never to print the same thing for distinct double-precision floating-point numbers. Experts prefer to use the hexadecimal format %a. Not only hexadecimal is more compact, but using it makes immediately obvious that 0.5 is represented exactly (it is 0x0.8p0 or 0x1.0p-1) and some other decimal numbers, such as 0.1, aren't (in hexadecimal the digits repeat and use up all the significand). In short, using hexadecimal for inputing and printing floating-point numbers will instantly make you a pro.

Adding two floating-point numbers

clang or gcc -frounding-math tells them that code might run with a non-default rounding mode. It's not fully safe (it assumes the same rounding mode is active the whole time), but better than nothing. You might still need to use volatile to avoid CSE in some cases, or maybe the noinline wrapper trick from the other answer which in practice may work even better if you limit it to a single operation.


As you noticed, GCC doesn't support #pragma STDC FENV_ACCESS ON. The default behaviour is like FENV_ACCESS OFF. Instead, you have to use command line options (or maybe per-function attributes) to control FP optimizations.

As described in https://gcc.gnu.org/wiki/FloatingPointMath, -frounding-math is not on by default, so GCC assumes the default rounding mode when doing constant propagation and other optimizations at compile-time.

But with gcc -O3 -frounding-math, constant propagation is blocked. Even if you don't call fesetround; what's actually happening is that GCC makes asm that's safe if the rounding mode had already been set to something else before main was even called.

But unfortunately, as the wiki notes, GCC still assumes that the same rounding mode is in effect everywhere (GCC bug #34678). That means it will CSE two calculations of the same inputs before/after a call to fesetround, because it doesn't treat fesetround as special.

#include <fenv.h>
#pragma STDC FENV_ACCESS ON

void foo(double *restrict out){
out[0] = 0x1.0p0 + 0x1.0p-80;
fesetround(FE_UPWARD);
out[1] = 0x1.0p0 + 0x1.0p-80;
}

compiles as follows (Godbolt) with gcc10.2 (and essentially the same with clang10.1). Also includes your main, which does make the asm you want.

foo:
push rbx
mov rbx, rdi
sub rsp, 16
movsd xmm0, QWORD PTR .LC1[rip]
addsd xmm0, QWORD PTR .LC0[rip] # runtime add
movsd QWORD PTR [rdi], xmm0 # store out[0]
mov edi, 2048
movsd QWORD PTR [rsp+8], xmm0 # save a local temporary for later
call fesetround
movsd xmm0, QWORD PTR [rsp+8]
movsd QWORD PTR [rbx+8], xmm0 # store the same value, not recalc
add rsp, 16
pop rbx
ret

This is the same problem @Marc Glisse warned about in comments under the other answer in case your noinline function did the same math before and after changing the rounding mode.

(And also that it's partly luck that GCC chose not to do the math before calling fesetround the first time, so it would only have to spill the result instead of both inputs. x86-64 System V doesn't have any call-preserved XMM regs.)

Is floating point math broken?

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

  • 0.1000000000000000055511151231257827021181583404541015625 in decimal, or
  • 0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

  • 0.1 in decimal, or
  • 0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These can be used as tolerance values but their effectiveness depends on the magnitude (size) of the numbers you're working with, since calculations with large numbers may exceed the epsilon threshold.

gcc precision bug?

This is something that has bitten me, too.

Yes, floating point numbers should never be compared for equality because of rounding error, and you probably knew that.

But in this case, you're computing t1+t2, then computing it again. Surely that has to produce an identical result?

Here's what's probably going on. I'll bet you're running this on an x86 CPU, correct? The x86 FPU uses 80 bits for its internal registers, but values in memory are stored as 64-bit doubles.

So t1+t2 is first computed with 80 bits of precision, then -- I presume -- stored out to memory in sum_2 with 64 bits of precision -- and some rounding occurs. For the assert, it's loaded back into a floating point register, and t1+t2 is computed again, again with 80 bits of precision. So now you're comparing sum_2, which was previously rounded to a 64-bit floating point value, with t1+t2, which was computed with higher precision (80 bits) -- and that's why the values aren't exactly identical.

Edit So why does the first test pass? In this case, the compiler probably evaluates 4.0+6.3 at compile time and stores it as a 64-bit quantity -- both for the assignment and for the assert. So identical values are being compared, and the assert passes.

Second Edit Here's the assembly code generated for the second part of the code (gcc, x86), with comments -- pretty much follows the scenario outlined above:

// t1 = 4.0
fldl LC3
fstpl -16(%ebp)

// t2 = 6.3
fldl LC4
fstpl -24(%ebp)

// sum_2 = t1+t2
fldl -16(%ebp)
faddl -24(%ebp)
fstpl -32(%ebp)

// Compute t1+t2 again
fldl -16(%ebp)
faddl -24(%ebp)

// Load sum_2 from memory and compare
fldl -32(%ebp)
fxch %st(1)
fucompp

Interesting side note: This was compiled without optimization. When it's compiled with -O3, the compiler optimizes all of the code away.

round() for float in C++

It's available since C++11 in cmath (according to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf)

#include <cmath>
#include <iostream>

int main(int argc, char** argv) {
std::cout << "round(0.5):\t" << round(0.5) << std::endl;
std::cout << "round(-0.5):\t" << round(-0.5) << std::endl;
std::cout << "round(1.4):\t" << round(1.4) << std::endl;
std::cout << "round(-1.4):\t" << round(-1.4) << std::endl;
std::cout << "round(1.6):\t" << round(1.6) << std::endl;
std::cout << "round(-1.6):\t" << round(-1.6) << std::endl;
return 0;
}

Output:

round(0.5):  1
round(-0.5): -1
round(1.4): 1
round(-1.4): -1
round(1.6): 2
round(-1.6): -2


Related Topics



Leave a reply



Submit