Math Precision Requirements of C and C++ Standard

Math precision requirements of C and C++ standard

No, and for good reason. In general, you'd need an infinite precision (and infinite time) to determine the exact mathematical result. Now most of the times you need only a few extra iterations to determine sufficient bits for rounding, but this number of extra bits depend on the exact result (simply put: when the result is close to .5 ULP). Even determining the extra number of iterations required is highly non-trivial. As a result, requiring exact results is far, far slower than a pragmatic approach.

C/C++: Are IEEE 754 float addition/multiplication/... and int-to-float conversion standardized?

No, unless the macro __STD_IEC_559__ is defined.

Basically the standard does not require IEEE 754 compatible floating point, so most compilers will use whatever floating point support the hardware provides. If the hardware provides IEEE compatible floating point, most compilers for that target will use it and predefine the __STD_IEC_559__ macro.

If the macro is defined, then IEEE 754 guarantees the bit representation (but not the byte order) of float and double as 32-bit and 64-bit IEEE 754. This in turn guarantees bit-exact representation of double arithmetic (but note that the C standard allows float arithmetic to happen at either 32 bit or 64 bit precision).

The C standard requires that float to int conversion be the same as the trunc function if the result is in range for the resulting type, but unfortunately IEEE doesn't actually define the behavior of functions, just of basic arithmetic. The C spec also allows the compiler reorder operations in violation of IEEE754 (which might affect precision), but most that support IEEE754 will not do that wihout a command line option.

Anecdotal evidence also suggest that some compilers do not define the macro even though they should while other compilers define it when they should not (do not follow all the requirements of IEEE 754 strictly). These cases should probably be considered compiler bugs.

Does the C++ standard specify anything on the representation of floating point numbers?

From N3337:

[basic.fundamental/8]: There are three floating point types: float, double, and long double. The type double provides at least
as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values
of the type double is a subset of the set of values of the type long double. The value representation of
floating-point types is implementation-defined
. Integral and floating types are collectively called arithmetic
types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum
and minimum values of each arithmetic type for an implementation.

If you want to check if your implementation uses IEEE-754, you can use std::numeric_limits::is_iec559:

static_assert(std::numeric_limits<double>::is_iec559,
"This code requires IEEE-754 doubles");

There are a number of other helper traits in this area, such as has_infinity, quiet_NaN and more.

Float operation difference in C vs C++

Introduction: Given that the question is not detailed enough, I am left to speculate the infamous gcc's 323 bug. As the low bug-ID suggests, this bug has been there forever. The bug report existed since June 2000, currently has 94 (!) duplicates, and the last one reported only half a year ago (on 2018-08-28). The bug affects only 32 bit executable on Intel computers (like cygwin). I assume that OP's code uses x87 floating point instructions, which are the default for 32 bit executables, while SSE instructions are only optional. Since 64 bit executables are more prevalent than 32, and no longer depend on x87 instructions, this bug has zero chance of ever being fixed.

Bug description: The x87 architecture has 80 bit floating point registers. The float requires only 32 bits. The bug is that x87 floating point operations are always done with 80 bits accuracy (subject to hardware configuration flag). This extra accuracy makes precision very flaky, because it depends on when the registers are being spilled (written) to memory.

If a 80 bit register is spilled into a 32 bit variable in memory, then extra precision is lost. This is the correct behavior if this happened after each floating point operation (since float is supposed to be 32 bits). However, spilling to memory slows things down and no compiler writer wants the executable to run slow. So by default the values are not spilled to memory.

Now, sometimes the value is spilled to memory and sometimes it is not. It depends on optimization level, on compiler heuristics, and on other seemingly random factors. Even with -O0 there could be slightly different strategies for dealing with spilling the x87 registers to memory, resulting in slightly different results. The strategy of spilling is probably the difference between your C and C++ compilers that you experience.

Work around:
For ways to handle this, please read c handling of excess precision. Try running your compiler with -fexcess-precision=standard and compare it with -fexcess-precision=fast. You can also try playing with -mfpmath=sse.

NOTE: According to the C++ standard this is not really a bug. However, it is a bug according to the documentation of GCC which claims to follow the IEEE-754 FP standard on Intel architectures (like it does on many other architectures). Obviously bug 323 violates the IEE-754 standard.

NOTE 2: On some optimization levels -fast-math is invoked, and all bets are off regarding to extra precision and evaluation order.

EDIT I have simulated the described behavior on an intel 64-bit system, and got the same results as the OP. Here is the code:

int main()
{
float a = hex2float(0x1D9969BB);
float b = hex2float(0x6CEDC83E);
float c = hex2float(0xAC89452F);
float d = hex2float(0xD2DC92B3);
float e = hex2float(0x4FE9F23C);
float result = (float)((double)a+b-c+d+e);
print("result", result);
result = flush(flush(flush(flush(a+b)-c)+d)+e);
print("result2", result);
}

The implementations of the support functions:

float hex2float(uint32_t num)
{
uint32_t rev = (num >> 24) | ((num >> 8) & 0xff00) | ((num << 8) & 0xff0000) | (num << 24);
float f;
memcpy(&f, &rev, 4);
return f;
}
void print(const char* label, float val)
{
printf("%10s (%13.10f) : 0x%02X%02X%02X%02X\n", label, val, ((unsigned char*)&val)[0],((unsigned char*)&val)[1],((unsigned char*)&val)[2],((unsigned char*)&val)[3]);
}
float flush(float x)
{
volatile float buf = x;
return buf;
}

After running this I have got exactly the same difference between the results:

  result ( 0.4185241461) : 0xCC48D63E
result2 ( 0.4185241759) : 0xCD48D63E

For some reason this is different than the "pure" version described at the question. At one point I was also getting the same results as the "pure" version, but since then the question has changed. The original values in the original question were different. They were:

float a = hex2float(0x1D9969BB);
float b = hex2float(0x6CEDC83E);
float c = hex2float(0xD2DC92B3);
float d = hex2float(0xA61FD930);
float e = hex2float(0x4FE9F23C);

and with these values the resulting output is:

   result ( 0.4185242951) : 0xD148D63E
result2 ( 0.4185242951) : 0xD148D63E

What is the standard way to maintain accuracy when dealing with incredibly precise floating point calculations in C++?

Maintaining precision when porting code like this is very difficult to do. Not because the languages have implicitly different perspectives on what a float is, but because of what the different algorithms or assumptions of accuracy limits are. For example, when performing numerical integration in Scilab, it may use a Gaussian quadrature method. Whereas you might try using a trapezoidal method. The two may both be working on identical IEEE754 single-precision floating point numbers, but you will get different answers due to the convergence characteristics of the two algorithms. So how do you get around this?

Well, you can go through the Scilab source code and look at all of the algorithms it uses for each thing you need. You can then replicate these algorithms taking care of any pre- or post-conditioning of the data that Scilab implicitly does (if any at all). That's a lot of work. And, frankly, probably not the best way to spend your time. Rather, I would look into using the Interfacing with Other Languages section from the developer's documentation to see how you can call the Scilab functions directly from your C, C++, Java, or Fortran code.

Of course, with the second option, you have to consider how you are going to distribute your code (if you need to).Scilab has a GPL-compatible license, so you can just bundle it with your code. However, it is quite big (~180MB) and you may want to just bundle the pieces you need (e.g., you don't need the whole interpreter system). This is more work in a different way, but guarantees numerical-compatibility with your current Scilab solutions.

How to compare two Math library implementations?

Some thoughts:

  • You can use the GNU Multiple Precision Arithmetic Library (GnuMP to generate good reference results.
  • It is possible to test most, if not all of the single-argument single-precision (IEEE-754 binary32) routines exhaustively. (For some of the macOS trigonometric routines, such as sinf, we tested one implementation exhaustively, verifying that it returned faithfully rounded results, meaning the result was the mathematical value [if representable] or one of the two adjacent values [if not]. Then, when changing implementations, we compared one to the other. If a new-implementation result was identical to the old-implementation result, it passed. Otherwise, GnuMP was used to test it. Since new implementations largely coincided with old implementations, this resulted in few invocations of GnuMP, so we were able to exhaustively test a new routine implementation in about three minutes, if I recall correctly.)
  • It is not feasible to test the multiple-argument or double-precision routines exhaustively.
  • When comparing implementations, you have to choose a metric, or several metrics. A library that has a good worst-case error is good for proofs; its bound can be asserted to hold true for any argument, and that can be used to derive further bounds in subsequent computations. But a library that has a good average error may tend to produce better results for, say, physics simulations that use large arrays of data. For some applications, only the errors in a “normal” domain may be relevant (angles around −2π to +2π), so errors in reducing large arguments (up to around 10308) may be irrelevant because those arguments are never used.
  • There are some common points where various routines should be tested. For example, for the trigonometric routines, test at various fractions of π. Aside from being mathematically interesting, these tend to be where implementations switch between approximations, internally. Also test at large numbers that are representable but happen to be very near multiples of simple fractions of π. These are the worst cases for argument reduction and can yield huge relative errors if not done correctly. They require number theory to find. Testing in any sort of scattershot approach, or even orderly approaches that fail to consider this reduction problem, will fail to find these troublesome arguments, so it would be easy to report as accurate a routine that had huge errors.
  • On the other hand, there are important points to test that cannot be known without internal knowledge of the implementation. For example, when designing a sine routine, I would use the Remez algorithm to find a minimax polynomial, aiming for it to be good from, say, –π/2 to +π/2 (quite large for this sort of thing, but just for example). Then I would look at the arithmetic and rounding errors that could occur during argument reduction. Sometimes they would produce a result a little outside that interval. So I would go back to the minimax polynomial generation and push for a slightly larger interval. And I would also look for improvements in the argument reduction. In the end, I would end up with a reduction guaranteed to produce results within a certain interval and a polynomial known to be good to a certain accuracy within that interval. To test my routine, you then need to know the endpoints of that interval, and you have to be able to find some arguments for which the argument reduction yields points near those endpoints, which means you have to have some idea of how my argument reduction is implemented—how many bits does it use, and such. Like the troublesome arguments mentioned above, these points cannot be found with a scattershot approach. But unlike those above, they cannot be found from pure mathematics; you need information about the implementation. This makes it practically impossible to know you have compared the worst potential arguments for implementations.

What is the minimum number of significant decimal digits in a floating point literal to represent the value as correct as possible?

What about double or even arbitrary precision, is there a simple formula to derive the number of digits needed?>

From C17 § 5.2.4.2.2 11 FLT_DECIMAL_DIG, DBL_DECIMAL_DIG, LDBL_DECIMAL_DIG

number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,

pmax log10 b: if b is a power of 10

1 + pmax log10 b: otherwise



But I'm interested in the math behind it. How can one be sure that 9 digits is enough in this case?

Each range of binary floating point like [1.0 ... 2.0), [128.0 ... 256.0), [0.125 ... 0.5) contains 2p - 1 values uniformly distributed. e.g. With float, p = 24.

Each range of a decade of decimal text with n significant digits in exponential notation like [1.0 ... 9.999...), [100.0f ... 999.999...), [0.001 ... 0.00999...) contains 10n - 1 values uniformly distributed.

Example: common float:

When p is 24 with 224 combinations, n must at least 8 to form the 16,777,216 combinations to distinctly round-trip float to decimal text to float. As the end-points of two decimal ranges above may exist well within that set of 224, the larger decimal values are spaced out further apart. This necessitates a +1 decimal digit.

Example:

Consider the 2 adjacent float values

10.000009_5367431640625
10.000010_49041748046875

Both convert to 8 significant digits decimal text "10.000010". 8 is not enough.

9 is always enough as we do not need more than 167,772,160 to distinguish 16,777,216 float values.


OP also asks about 8388609.499. (Let us only consider float for simplicity.)

That value is nearly half-way between 2 float values.

8388609.0f  // Nearest lower float value
8388609.499 // OP's constant as code
8388610.0f // Nearest upper float value

OP reports: "You can see 8388609.499 needs more than 9 digits to be most accurately converted to float."

And let us review the title "What is the minimum number of significant decimal digits in a floating point literal*1 to represent the value as correct as possible?"

This new question part emphasizes that the value in question is the value of the source code 8388609.499 and not the floating point constant it becomes in emitted code: 8388608.0f.

If we consider the value to be the value of the floating point constant, only up to 9 significant decimal digits are needed to define the floating point constant 8388608.0f. 8388608.49, as source code is sufficient.

But to get the closest floating point constant based on some number as code yes indeed could take many digits.

Consider the typical smallest float, FLT_TRUE_MIN with the exact decimal value of :

0.00000000000000000000000000000000000000000000140129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125

Half way between that and 0.0 is 0.000..(~39 more zeroes)..0007006..(~ 100 more digits)..15625.

It that last digit was 6 or 4, the closest float would be FLT_TRUE_MIN or 0.0f respectively. So now we have a case where 109 significant digits are "needed" to select between 2 possible float.

To forego us going over the cliffs of insanity, IEEE-758 has already addressed this.

The number of significant decimal digits a translation (compiler) must examine to be compliant with that spec (not necessarily the C spec) is far more limited, even if the extra digits could translate to another FP value.

IIRC, it is in effect FLT_DECIMAL_DIG + 3. So for a common float, as little as 9 + 3 significant decimal digits may be examined.

[Edit]

correct rounding is only guaranteed for the number of decimal digits required plus 3 for the largest supported binary format.


*1 C does not define: floating point literal, but does define floating point constant, so that term is used.



Related Topics



Leave a reply



Submit