C++ Handling of Excess Precision

C++ handling of excess precision

Are there more C++-like ways to control the handling of excess precision?

The C99 standard defines FLT_EVAL_METHOD, a compiler-set macro that defines how excess precision should happen in a C program (many C compilers still behave in a way that does not exactly conform to the most reasonable interpretation of the value of FP_EVAL_METHOD that they define: older GCC versions generating 387 code, Clang when generating 387 code, …). Subtle points in relation with the effects of FLT_EVAL_METHOD were clarified in the C11 standard.

Since the 2011 standard, C++ defers to C99 for the definition of FLT_EVAL_METHOD (header cfloat).

So GCC should simply allow -fexcess-precision=standard for C++, and hopefully it eventually will. The same semantics as that of C are already in the C++ standard, they only need to be implemented in C++ compilers.


I guess that for my current use case, I'll probably use -mfpmath=sse in any case, which should not incur any excess precision as far as I know.

That is the usual solution.

Be aware that C99 also defines FP_CONTRACT in math.h that you may want to look at: it relates to the same problem of some expressions being computed at a higher precision, striking from a completely different side (the modern fused-multiply-add instruction instead of the old 387 instruction set). This is a pragma to decide whether the compiler is allowed to replace source-level additions and multiplications with FMA instructions (this has the effect that the multiplication is virtually computed at infinite precision, because this is how this instruction works, instead of being rounded to the precision of the type as it would be with separate multiplication and addition instructions). This pragma has apparently not been incorporated in the C++ standard (as far as I can see).

The default value for this option is implementation-defined and some people argue for the default to be to allow FMA instructions to be generated (for C compilers that otherwise define FLT_EVAL_METHOD as 0).
You should, in C, future-proof
your code with:

#include <math.h>
#pragma STDC FP_CONTRACT off

And the equivalent incantation in C++ if your compiler documents one.


what is the g++ default behavior in absence of any switch?

I am afraid that the answer to this question is that GCC's behavior, say, when generating 387 code, is nonsensical. See the description of the situation that motivated Joseph Myers to fix the situation for C. If g++ does not implement -fexcess-precision=standard, it probably means that 80-bit computations are randomly rounded to the precision of the type when the compiler happened to have to spill some floating-point registers to memory, leading the program below to print "foo" in some circumstances outside the programmer's control:

if (x == 0.0) return;
... // code that does not modify x
if (x == 0.0) printf("foo\n");

… because the code in the ellipsis caused x, that was held in an 80-bit floating-point register, to be spilt to a 64-bit slot on the stack.

In C++, is exactly one of , == and guaranteed to be true on floats?

No.

It's enough for either a or b to be NaN for each of a < b, a == b and a > b to be false.

If both a and b are non-NaN then exactly one of a < b, a == b or a > b has to be true.

In complement, this answer tells you how you can get a NaN value in C++ (there are several NaN values, that can be distinguished by inspecting their representations; they are all different from each other because NaN is never equal to anything,) and how you can test whether a value is a NaN (an idiomatic test to see if a variable x is a NaN is x != x, and indeed std::isnan() is often implemented this way, but some programmers who will have to read your code may be confused by it).

And then, if a and b are the results of previous computations, there is the problem of excess precision. See this article for a discussion in C. The C99 standard solved the problem by making rules explicit for where excess precision could and could not occur, but despite C++ more or less inheriting these rules by deferring to the C standard for the definition of FLT_EVAL_METHOD in cfloat, in practice C compilers take the rules more seriously than C++ compilers. For instance GCC implements the rules for C when compiling with -std=c99, and in this context you can rely on the property to hold, but as of this writing GCC does not implement these rules when used as a C++ compiler.

Float operation difference in C vs C++

Introduction: Given that the question is not detailed enough, I am left to speculate the infamous gcc's 323 bug. As the low bug-ID suggests, this bug has been there forever. The bug report existed since June 2000, currently has 94 (!) duplicates, and the last one reported only half a year ago (on 2018-08-28). The bug affects only 32 bit executable on Intel computers (like cygwin). I assume that OP's code uses x87 floating point instructions, which are the default for 32 bit executables, while SSE instructions are only optional. Since 64 bit executables are more prevalent than 32, and no longer depend on x87 instructions, this bug has zero chance of ever being fixed.

Bug description: The x87 architecture has 80 bit floating point registers. The float requires only 32 bits. The bug is that x87 floating point operations are always done with 80 bits accuracy (subject to hardware configuration flag). This extra accuracy makes precision very flaky, because it depends on when the registers are being spilled (written) to memory.

If a 80 bit register is spilled into a 32 bit variable in memory, then extra precision is lost. This is the correct behavior if this happened after each floating point operation (since float is supposed to be 32 bits). However, spilling to memory slows things down and no compiler writer wants the executable to run slow. So by default the values are not spilled to memory.

Now, sometimes the value is spilled to memory and sometimes it is not. It depends on optimization level, on compiler heuristics, and on other seemingly random factors. Even with -O0 there could be slightly different strategies for dealing with spilling the x87 registers to memory, resulting in slightly different results. The strategy of spilling is probably the difference between your C and C++ compilers that you experience.

Work around:
For ways to handle this, please read c handling of excess precision. Try running your compiler with -fexcess-precision=standard and compare it with -fexcess-precision=fast. You can also try playing with -mfpmath=sse.

NOTE: According to the C++ standard this is not really a bug. However, it is a bug according to the documentation of GCC which claims to follow the IEEE-754 FP standard on Intel architectures (like it does on many other architectures). Obviously bug 323 violates the IEE-754 standard.

NOTE 2: On some optimization levels -fast-math is invoked, and all bets are off regarding to extra precision and evaluation order.

EDIT I have simulated the described behavior on an intel 64-bit system, and got the same results as the OP. Here is the code:

int main()
{
float a = hex2float(0x1D9969BB);
float b = hex2float(0x6CEDC83E);
float c = hex2float(0xAC89452F);
float d = hex2float(0xD2DC92B3);
float e = hex2float(0x4FE9F23C);
float result = (float)((double)a+b-c+d+e);
print("result", result);
result = flush(flush(flush(flush(a+b)-c)+d)+e);
print("result2", result);
}

The implementations of the support functions:

float hex2float(uint32_t num)
{
uint32_t rev = (num >> 24) | ((num >> 8) & 0xff00) | ((num << 8) & 0xff0000) | (num << 24);
float f;
memcpy(&f, &rev, 4);
return f;
}
void print(const char* label, float val)
{
printf("%10s (%13.10f) : 0x%02X%02X%02X%02X\n", label, val, ((unsigned char*)&val)[0],((unsigned char*)&val)[1],((unsigned char*)&val)[2],((unsigned char*)&val)[3]);
}
float flush(float x)
{
volatile float buf = x;
return buf;
}

After running this I have got exactly the same difference between the results:

  result ( 0.4185241461) : 0xCC48D63E
result2 ( 0.4185241759) : 0xCD48D63E

For some reason this is different than the "pure" version described at the question. At one point I was also getting the same results as the "pure" version, but since then the question has changed. The original values in the original question were different. They were:

float a = hex2float(0x1D9969BB);
float b = hex2float(0x6CEDC83E);
float c = hex2float(0xD2DC92B3);
float d = hex2float(0xA61FD930);
float e = hex2float(0x4FE9F23C);

and with these values the resulting output is:

   result ( 0.4185242951) : 0xD148D63E
result2 ( 0.4185242951) : 0xD148D63E

Precision of floating-point data types in C++

"Precision" is not all that is to a floating point value. It's also about "magnitude" (not sure if that term is correct though!): How big (or small) can the represented values become?

For that, try printing also the max_exponent of each type:

std::cout << "float: " << sizeof(float) << "\n";
std::cout << std::numeric_limits<float>::digits << "\n";
std::cout << std::numeric_limits<float>::max_exponent << "\n";

std::cout << "double: " << sizeof(double) << "\n";
std::cout << std::numeric_limits<double>::digits << "\n";
std::cout << std::numeric_limits<double>::max_exponent << "\n";

std::cout << "long double: " << sizeof(long double) << "\n";
std::cout << std::numeric_limits<long double>::digits << "\n";
std::cout << std::numeric_limits<long double>::max_exponent << "\n";

Output on ideone:

float: 4
24
128
double: 8
53
1024
long double: 16
64
16384

So the extra bits are not all used to represent more digits (precision) but allow the exponent to be larger. Using the wording from IEE 754 long double mostly increases the exponent range rather than the precision.

The format which is shown by my ideone sample above shows (probably) the "x86 extended precision format" which assigns 1 bit for the integer part, 63 bits for the fraction part (together 64 digits) and 15 bits (2^(15-1) = 16384, 1 bit used for the sign of the exponent) for the exponent.

Note that the C++ standard only requires long double to be at least as precise as double, so long double could be either a synonym to double, the shown x86 extended precision format (most likely) or better (AFAIK only GCC on PowerPC).

And I also want to know how can I get better precision, without defining my own data type which obviously going to be at expense of performance?

You need to either write it on your own (surely a learning experience, best not to do for production code) or use a library, like GNU MPFR or Boost.Multiprecision.

g++-floating point behavior with -O option not strictly C++11-standard conformal?

This is the mentioned paragraph:

The values of the floating operands and the results of floating expressions may be represented in greater
precision and range than that required by the type; the types are not changed thereby.

And the footnote:

The cast and assignment operators must still perform their specific conversions as described in 5.4, 5.2.9 and 5.17.

I don't think that GCC violates this. e is represented in greater precision, and this is allowed by the standard (there is no conversion here, so the footnote doesn't apply).

If you don't like this behavior, use the -ffloat-store option, which removes this excess precision, and your program prints inf (but this option makes your program slower).


Note, this specification is a little bit weird.

Consider this. This is the same example as yours, but using float:

#include <iostream>

int main() {
float d = 2.0e+38f;
float e;
e = d*d;
e = e/2e+38f;
std::cout << e;
}

This prints 2e+38, instead of inf. Now, if you change d's type to double: there is a conversion at e = d*d, footnote applies, and your program should print inf. And GCC behaves standard conformant, and it prints inf indeed (tested with gcc 5.4.1 and gcc 8.1.0).

Properties of 80-bit extended precision computations starting from double precision arguments

Yes, interpol_80() is safe, let's demonstrate it.

The problem states that inputs are 64bits float

rnd64(ui) = ui

The result is exactly (assuming * and + are mathematical operations)

r = u2*(1-u1)+(u1*u3)

Optimal return value rounded to 64 bit float is

r64 = rnd64(r)

As we have these properties

u2 <= r <= u3

It is guaranteed that

rnd64(u2) <= rnd64(r) <= rnd64(u3)
u2 <= r64 <= u3

Conversion to 80bits of u1,u2,u3 are exact too.

rnd80(ui)=ui

Now, let's assume 0 <= u2 <= u3, then performing with inexact float operations leads to at most 4 rounding errors:

rf = rnd(rnd(u2*rnd(1-u1)) + rnd(u1*u3))

Assuming round to nearest even, this will be at most 2 ULP off exact value.
If rounding is performed with 64 bits float or 80 bits floats:

r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)
r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)

rf64 can be off by 2 ulp so interpol-64() is unsafe, but what about rnd64( rf80 )?

We can tell that:

rnd64(r - 2 ulp80(r)) <= rnd64(rf80) <= rnd64(r + 2 ulp80(r))

Since 0 <= u2 <= u3, then

ulp80(u2) <= ulp80(r) <= ulp80(r3)
rnd64(u2 - 2 ulp80(u2)) <= rnd64(r - 2 ulp80(r)) <= rnd64(rf80)
rnd64(u3 + 2 ulp80(u3)) >= rnd64(r + 2 ulp80(r)) >= rnd64(rf80)

Fortunately, like every number in range (u2-ulp64(u2)/2 , u2+ulp64(u2)/2) we get

rnd64(u2 - 2 ulp80(u2)) = u2
rnd64(u3 + 2 ulp80(u3)) = u3

since ulp80(x)=ulp62(x)/2^(64-53)

We thus get the proof

u2 <= rnd64(rf80) <= u3

For u2 <= u3 <= 0, we can apply same proof easily.

The last case to be studied is u2 <= 0 <= u3. If we subtract 2 big values, then result can be up to ulp(big)/2 off rather than ulp(big-big)/2...

Thus this assertion we made doesn't hold anymore:

r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)

Fortunately, u2 <= u2*(1-u1) <= 0 <= u1*u3 <= u3 and this is preserved after rounding

u2 <= rnd(u2*rnd(1-u1)) <= 0 <= rnd(u1*u3) <= u3

Thus since added quantities are of opposite sign:

u2 <= rnd(u2*rnd(1-u1)) + rnd(u1*u3) <= u3

same goes after rounding, so we can once again guaranty

u2 <= rnd64( rf80 ) <= u3

QED

To be complete we should care of denormal inputs (gradual underflow), but I hope you won't be that vicious with stress tests. I won't demonstrate what happens with those...

EDIT:

Here is a follow-up as the following assertion was a bit approximative and generated some comments when 0 <= u2 <= u3

r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)

We can write the following inequalities:

rnd(1-u1) <= 1
rnd(1-u1) <= 1-u1+ulp(1)/4
u2*rnd(1-u1) <= u2 <= r
u2*rnd(1-u1) <= u2*(1-u1)+u2*ulp(1)/4
u2*ulp(1) < 2*ulp(u2) <= 2*ulp(r)
u2*rnd(1-u1) < u2*(1-u1)+ulp(r)/2

For next rounding operation, we use

ulp(u2*rnd(1-u1)) <= ulp(r)
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(u2*rnd(1-u1))/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(r)/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)

For second part of the sum, we have:

u1*u3 <= r
rnd(u1*u3) <= u1*u3 + ulp(u1*u3)/2
rnd(u1*u3) <= u1*u3 + ulp(r)/2

rnd(u2*rnd(1-u1))+rnd(u1*u3) < u2*(1-u1)+u1*u3 + 3*ulp(r)/2
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 3*ulp(r)/2 + ulp(r+3*ulp(r)/2)/2
ulp(r+3*ulp(r)/2) <= 2*ulp(r)
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 5*ulp(r)/2

I didn't prove the original claim, but not that far...



Related Topics



Leave a reply



Submit