Comparison of Double, Long Double, Float and Float128

Comparison of double, long double, float and float128?

My platform does not have a __float128, but here's an example showing output by precision with float, double, and long double:

#include <cmath> 
#include <iomanip>
#include <iostream>
#include <limits>

int main()
{
float PI0 = std::acos(-1.0F);
double PI1 = std::acos(-1.0);
long double PI2 = std::acos(-1.0L);

constexpr auto PI0_max_digits10 = std::numeric_limits<decltype(PI0)>::max_digits10;
constexpr auto PI1_max_digits10 = std::numeric_limits<decltype(PI1)>::max_digits10;
constexpr auto PI2_max_digits10 = std::numeric_limits<decltype(PI2)>::max_digits10;

std::cout << "Float=" << std::setprecision(PI0_max_digits10) << PI0 << std::endl;
std::cout << "Double=" << std::setprecision(PI1_max_digits10) << PI1 << std::endl;
std::cout << "LongDouble=" << std::setprecision(PI2_max_digits10) << PI2 << std::endl;
}

long double (GCC specific) and __float128

Ad 1.

Those types are designed to work with numbers with huge dynamic range. The long double is implemented in a native way in the x87 FPU. The 128b double I suspect would be implemented in software mode on modern x86s, as there's no hardware to do the computations in hardware.

The funny thing is that it's quite common to do many floating point operations in a row and the intermediate results are not actually stored in declared variables but rather stored in FPU registers taking advantage of full precision. That's why comparison:

double x = sin(0); if (x == sin(0)) printf("Equal!");

Is not safe and cannot be guaranteed to work (without additional switches).

Ad. 3.

There's an impact on the speed depending what precision you use. You can change used the precision of the FPU by using:

void 
set_fpu (unsigned int mode)
{
asm ("fldcw %0" : : "m" (*&mode));
}

It will be faster for shorter variables, slower for longer. 128bit doubles will be probably done in software so will be much slower.

It's not only about RAM memory wasted, it's about cache being wasted. Going to 80 bit double from 64b double will waste from 33% (32b) to almost 50% (64b) of the memory (including cache).

Ad 4.

On the other hand, I understand that the long double type is mutually
exclusive with -mfpmath=sse, as there is no such thing as "extended
precision" in SSE. __float128, on the other hand, should work just
perfectly fine with SSE math (though in absence of quad precision
instructions certainly not on a 1:1 instruction base). Am I right under
these assumptions?

The FPU and SSE units are totally separate. You can write code using FPU at the same time as SSE. The question is what will the compiler generate if you constrain it to use only SSE? Will it try to use FPU anyway? I've been doing some programming with SSE and GCC will generate only single SISD on its own. You have to help it to use SIMD versions. __float128 will probably work on every machine, even the 8-bit AVR uC. It's just fiddling with bits after all.

The 80 bit in hex representation is actually 20 hex digits. Maybe the bits which are not used are from some old operation? On my machine, I compiled your code and only 20 bits change in long
mode: 66b4e0d2-ec09c1d5-00007ffe-deadbeef

The 128-bit version has all the bits changing. Looking at the objdump it looks as if it was using software emulation, there are almost no FPU instructions.

Further, LDBL_MAX, seems to work as +inf for both long double and
__float128. Adding or subtracting a number like 1.0E100 or 1.0E2000 to/from LDBL_MAX results in the same bit pattern. Up to now, it was my
belief that the foo_MAX constants were to hold the largest
representable number that is not +inf (apparently that isn't the
case?).

This seems to be strange...

I'm also not quite sure how an 80-bit number could conceivably
act as +inf for a 128-bit value... maybe I'm just too tired at the end
of the day and have done something wrong.

It's probably being extended. The pattern which is recognized to be +inf in 80-bit is translated to +inf in 128-bit float too.

What is the internal precision of numpy.float128?

It's quite recommended to use longdouble instead of float128, since it's quite a mess, ATM. Python will cast it to float64 during initialization.

Inside numpy, it can be a double or a long double. It's defined in npy_common.h and depends of your platform. I don't know if you can include it out-of-the-box into your source code.

If you don't need performance in this part of your algorithm, a safer way could be to export it to a string and use strold afterwards.

More Precise Floating point Data Types than double?

According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double for ages, but recently things have started to change.

The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128 type and there is (if memory serves) a compiler option to set long double to it.

float128 and double-double arithmetic

“In this case, we use two double to store the value. So we need to make two operations at each time.”

This is not how double-double arithmetic works. You should expect one double-double operation to be implemented in anywhere from 6 to 20 double operations depending on the actual operation being implemented, the availability of a fused-multiply-add operation, the assumption that one operand is larger than the other, …

For instance, here is one implementation of a double-double multiplication for when an FMA instruction is not available, taken from CRlibm:

#define Mul22(zh,zl,xh,xl,yh,yl)                      \
{ \
double mh, ml; \
\
const double c = 134217729.; \
double up, u1, u2, vp, v1, v2; \
\
up = (xh)*c; vp = (yh)*c; \
u1 = ((xh)-up)+up; v1 = ((yh)-vp)+vp; \
u2 = (xh)-u1; v2 = (yh)-v1; \
\
mh = (xh)*(yh); \
ml = (((u1*v1-mh)+(u1*v2))+(u2*v1))+(u2*v2); \
\
ml += (xh)*(yl) + (xl)*(yh); \
*zh = mh+ml; \
*zl = mh - (*zh) + ml; \
}

The first 8 operations alone are for dividing exactly each double from the operands into two halves so that one half from each side can be multiplied with one half from the other side and the result obtained exactly as a double. The computations u1*v1, u1*v2, … do exactly that.

The values obtained in mh and ml can overlap, so the last 3 operations are there to renormalize the result into the sum of two floating-point numbers.

In this case we can have round-off errors on each double or their is a mechanism that avoid this?

As the comment says:

/*
* computes double-double multiplication: zh+zl = (xh+xl) * (yh+yl)
* relative error is smaller than 2^-102
*/

You can find about all the mechanisms used to achieve these results in the Handbook of Floating-Point Arithmetic.

A more accurate data type than float or double? C++

In some compilers, and on some architectures, "long double" will give give you more precision than double. If you are on an x86 platform the x87 FPU has an "extended" 80-bit floating point format. Gcc and Borland compilers give you an 80 bit float value when you use the "long double" type. Note that Visual Studio does not support this (the maximum supported by MSVC is double precision, 64 bit).

There is something called a "double double" which is a software technique for implementing quad-precision 128-bit floating point. You can find libraries that implement it.

You could also investigate libraries for arbitrary precision arithmetic.

For some calculations a 64 bit integer is a better choice than a 64 bit floating point value.

But if your question is about built-in types in current C++ compilers on common platforms then the answer is that you're limited to double (64 bit floating point), and on 64 bit platforms you have 64 bit ints. If you can stick to x86 and use the right compiler you can also have long double (80-bit extended precision).

You might be interested in this question:

long double (GCC specific) and __float128



Related Topics



Leave a reply



Submit