Why does adding 0 to the end of float literal change how it rounds (possible GCC bug)?
Your problem is that long double
on your platform has insufficient precision to store the exact value 0.99999999999999999999. This means that the value of that must be converted to a representable value (this conversion happens during translation of your program, not at runtime).
This conversion can generate either the nearest representable value, or the next greater or smaller representable value. The choice is implementation-defined, so your implementation should document which it is using. It seems that your implementation uses x87-style 80bit long double
, and is rounding to the nearest value, resulting in a value of 1.0 stored in x
.
With the assumed format for long double
(with 64 mantissa bits), the highest representable number less than 1.0 is, in hexadecimal:
0x0.ffffffffffffffff
The number exactly halfway between this value and the next higher representable number (1.0) is:
0x0.ffffffffffffffff8
Your very long constant 0.9999999999999999999728949456878623891498136799780 is equal to:
0x0.ffffffffffffffff7fffffffffffffffffffffffa1eb2f0b64cf31c113a8ec...
which should obviously be rounded down if rounding to nearest, but you appear to have reached some limit of the floating point representation your compiler is using, or a rounding bug.
Why does adding a small float to a large float just drop the small one?
32-bit floats only have 24 bits of precision. Thus, a float cannot hold b
exactly - it does the best job it can by setting some exponent and mantissa to get as close as possible1. (The nearest representable float
to the constant in the source; the default FP rounding mode is "nearest".)
When you then consider the floating point representation of b
and a
, and try and add them, the addition operation will shift the small number a
's mantissa downwards as it tries to match b
's exponent, to the point where the value (3) falls off the end and you're left with 0. Hence, the addition operator ends up adding floating point zero to b
. (This is an over-simplification; low bits can still affect rounding if there's partial overlap of mantissas.)
In general, the infinite-precision addition result has to get rounded to the nearest float
with the current FP rounding mode, and that happened to be equal to b
.
See also Why adding big to small in floating point introduce more error? for cases where the number changes some, but with large rounding error, for an example using decimal significant figures as a way to help understand binary float rounding.
Footnote 1: For numbers that large, the nearest two floats are 32 apart. Modern clang even warns about rounding of an int
in the source to a float
that represents a different value. Unless you write it as a float or double constant already (like 299792458.0f
), in which case the rounding happens without warning.
That's why the smallest a
value that will round sum
up to 299792480.0f
instead of down to 299792448.0f
is about 16.000001 for that b
value which rounded to 299792448.0f
. Runnable example on the Godbolt compiler explorer.
The default FP rounding mode rounds to nearest with even mantissa as a tie-break. 16.0 goes exactly half-way, and thus round to a bit-pattern of 0x4d8ef3c2, not up to 0x4d8ef3c3. https://www.h-schmidt.net/FloatConverter/IEEE754.html. Anything slightly greater than 16 rounds up, because rounding cares about the infinite-precision result. It doesn't actually shift out bits before adding, that was an over-simplification. The nearest float to 16.000001 has only the low bit set in its mantissa, bit-pattern 0x41800001. It's actually about 1.0000001192092896 x 24, or 16.0000019... Much smaller and it would round to exactly 16.0f and would be <= 1 ULP (unit in the last place) of b
, which wouldn't change b
because b
's mantissa is already even.
If you avoid early rounding by using double a,b
, the smallest value you can add that rounds up 299792480.0f
instead of down to 299792448.0f
when you do float sum = a+b
is about a=6.0000001;
, which makes sense because the integer value ...58 stays as ...58.0
instead of rounding down to ...48.0f
, i.e. the rounding error in float b = ...58
was -10, so a
can be that much smaller.
There are two rounding steps this time, though, with a+b
rounding to the nearest double
if that addition isn't exact, then that double
rounding to a float
. (Or if FLT_EVAL_METHOD
== 2, like C compiling for 80-bit x87 floating point on 32-bit x86, the +
result would round to to 80-bit long double
, then to float
.)
Floating point rounding in C
1.5
is a double
constant rather than a float
and C has automatic promotion rules. So when you perform 1.5*t
what happens is (i) t
is converted to a double
; (ii) that double
is multiplied by the double
1.5
; and (iii) the double
is printed (as %f
is the formatter for a double
).
Conversely, t *= 1.5
promotes t
to a double, performs a double multiplication and then truncates the result to store it back into a [single precision] float
.
For evidence, try either:
float t;
t = 5592411;
printf("%f\n", 1.5f*t); // multiply a float by a float, for no promotion
t *= 1.5;
printf("%f\n", t);
return 0;
Or:
double t; // store our intermediate results in a double
t = 5592411;
printf("%f\n", 1.5f*t);
t *= 1.5;
printf("%f\n", t);
return 0;
double numbers precision in C
The best way to answer your question is by changing the line of code slightly:
printf("is %.16e <= %.16e ? %s", 0.5, 0.49999999999999999, 0.5 <= 0.49999999999999999 ? "true":"false" );
Compiling this line, you will see that 0.49999999999999999
is rounded to the nearest representable double
, which is the same as for 0.5
(and indeed represents the exact value 1/2). This explains why one is lower than or equal to the other: they are two different notations for the same thing.
Note: I chose the format %.16e
as this format has the property never to print the same thing for distinct double-precision floating-point numbers. Experts prefer to use the hexadecimal format %a
. Not only hexadecimal is more compact, but using it makes immediately obvious that 0.5
is represented exactly (it is 0x0.8p0
or 0x1.0p-1
) and some other decimal numbers, such as 0.1, aren't (in hexadecimal the digits repeat and use up all the significand). In short, using hexadecimal for inputing and printing floating-point numbers will instantly make you a pro.
Adding two floating-point numbers
clang or gcc -frounding-math
tells them that code might run with a non-default rounding mode. It's not fully safe (it assumes the same rounding mode is active the whole time), but better than nothing. You might still need to use volatile
to avoid CSE in some cases, or maybe the noinline wrapper trick from the other answer which in practice may work even better if you limit it to a single operation.
As you noticed, GCC doesn't support #pragma STDC FENV_ACCESS ON
. The default behaviour is like FENV_ACCESS OFF
. Instead, you have to use command line options (or maybe per-function attributes) to control FP optimizations.
As described in https://gcc.gnu.org/wiki/FloatingPointMath, -frounding-math
is not on by default, so GCC assumes the default rounding mode when doing constant propagation and other optimizations at compile-time.
But with gcc -O3 -frounding-math
, constant propagation is blocked. Even if you don't call fesetround
; what's actually happening is that GCC makes asm that's safe if the rounding mode had already been set to something else before main was even called.
But unfortunately, as the wiki notes, GCC still assumes that the same rounding mode is in effect everywhere (GCC bug #34678). That means it will CSE two calculations of the same inputs before/after a call to fesetround
, because it doesn't treat fesetround
as special.
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
void foo(double *restrict out){
out[0] = 0x1.0p0 + 0x1.0p-80;
fesetround(FE_UPWARD);
out[1] = 0x1.0p0 + 0x1.0p-80;
}
compiles as follows (Godbolt) with gcc10.2 (and essentially the same with clang10.1). Also includes your main
, which does make the asm you want.
foo:
push rbx
mov rbx, rdi
sub rsp, 16
movsd xmm0, QWORD PTR .LC1[rip]
addsd xmm0, QWORD PTR .LC0[rip] # runtime add
movsd QWORD PTR [rdi], xmm0 # store out[0]
mov edi, 2048
movsd QWORD PTR [rsp+8], xmm0 # save a local temporary for later
call fesetround
movsd xmm0, QWORD PTR [rsp+8]
movsd QWORD PTR [rbx+8], xmm0 # store the same value, not recalc
add rsp, 16
pop rbx
ret
This is the same problem @Marc Glisse warned about in comments under the other answer in case your noinline function did the same math before and after changing the rounding mode.
(And also that it's partly luck that GCC chose not to do the math before calling fesetround
the first time, so it would only have to spill the result instead of both inputs. x86-64 System V doesn't have any call-preserved XMM regs.)
Is floating point math broken?
Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1
, which is 1/10
) whose denominator is not a power of two cannot be exactly represented.
For 0.1
in the standard binary64
format, the representation can be written exactly as
0.1000000000000000055511151231257827021181583404541015625
in decimal, or0x1.999999999999ap-4
in C99 hexfloat notation.
In contrast, the rational number 0.1
, which is 1/10
, can be written exactly as
0.1
in decimal, or0x1.99999999999999...p-4
in an analogue of C99 hexfloat notation, where the...
represents an unending sequence of 9's.
The constants 0.2
and 0.3
in your program will also be approximations to their true values. It happens that the closest double
to 0.2
is larger than the rational number 0.2
but that the closest double
to 0.3
is smaller than the rational number 0.3
. The sum of 0.1
and 0.2
winds up being larger than the rational number 0.3
and hence disagreeing with the constant in your code.
A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.
Side Note: All positional (base-N) number systems share this problem with precision
Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...
You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.
Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).
So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)
Side Side Note: Working with Floats in Programming
In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.
You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:
Do not do if (x == y) { ... }
Instead do if (abs(x - y) < myToleranceValue) { ... }
.
where abs
is the absolute value. myToleranceValue
needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These can be used as tolerance values but their effectiveness depends on the magnitude (size) of the numbers you're working with, since calculations with large numbers may exceed the epsilon threshold.
gcc precision bug?
This is something that has bitten me, too.
Yes, floating point numbers should never be compared for equality because of rounding error, and you probably knew that.
But in this case, you're computing t1+t2
, then computing it again. Surely that has to produce an identical result?
Here's what's probably going on. I'll bet you're running this on an x86 CPU, correct? The x86 FPU uses 80 bits for its internal registers, but values in memory are stored as 64-bit doubles.
So t1+t2
is first computed with 80 bits of precision, then -- I presume -- stored out to memory in sum_2
with 64 bits of precision -- and some rounding occurs. For the assert, it's loaded back into a floating point register, and t1+t2
is computed again, again with 80 bits of precision. So now you're comparing sum_2
, which was previously rounded to a 64-bit floating point value, with t1+t2
, which was computed with higher precision (80 bits) -- and that's why the values aren't exactly identical.
Edit So why does the first test pass? In this case, the compiler probably evaluates 4.0+6.3
at compile time and stores it as a 64-bit quantity -- both for the assignment and for the assert. So identical values are being compared, and the assert passes.
Second Edit Here's the assembly code generated for the second part of the code (gcc, x86), with comments -- pretty much follows the scenario outlined above:
// t1 = 4.0
fldl LC3
fstpl -16(%ebp)
// t2 = 6.3
fldl LC4
fstpl -24(%ebp)
// sum_2 = t1+t2
fldl -16(%ebp)
faddl -24(%ebp)
fstpl -32(%ebp)
// Compute t1+t2 again
fldl -16(%ebp)
faddl -24(%ebp)
// Load sum_2 from memory and compare
fldl -32(%ebp)
fxch %st(1)
fucompp
Interesting side note: This was compiled without optimization. When it's compiled with -O3
, the compiler optimizes all of the code away.
round() for float in C++
It's available since C++11 in cmath (according to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf)
#include <cmath>
#include <iostream>
int main(int argc, char** argv) {
std::cout << "round(0.5):\t" << round(0.5) << std::endl;
std::cout << "round(-0.5):\t" << round(-0.5) << std::endl;
std::cout << "round(1.4):\t" << round(1.4) << std::endl;
std::cout << "round(-1.4):\t" << round(-1.4) << std::endl;
std::cout << "round(1.6):\t" << round(1.6) << std::endl;
std::cout << "round(-1.6):\t" << round(-1.6) << std::endl;
return 0;
}
Output:
round(0.5): 1
round(-0.5): -1
round(1.4): 1
round(-1.4): -1
round(1.6): 2
round(-1.6): -2
Related Topics
Unit Test That a Class Is Non Copyable, and Other Compile-Time Properties
How to Initialize a Struct to 0 in C++
How to Combine Std::Bind(), Variadic Templates, and Perfect Forwarding
Partial Specialization of Function Templates
Immediate Exit of 'While' Loop in C++
In Which Versions of the C++ Standard Does "(I+=10)+=10" Have Undefined Behaviour
Does Calling a Destructor Explicitly Destroy an Object Completely
Redirect Both Cout and Stdout to a String in C++ for Unit Testing
What Are Practical Applications of Weak Linking
How to Determine If Returned Pointer Is on the Stack or Heap
Understanding Virtual Base Classes and Constructor Calls
How to Initialize an Array of Struct in C++
How to Assign a Value to the Pointer 'This' in C++
C++ Logon Task Schedule Error: No Mapping Between Account Names and Security Ids Was Done