Ensuring C++ Doubles Are 64 Bits

Ensuring C++ doubles are 64 bits

An improvement on the other answers (which assume a char is 8-bits, the standard does not guarantee this..). Would be like this:

char a[sizeof(double) * CHAR_BIT == 64];

or

BOOST_STATIC_ASSERT(sizeof(double) * CHAR_BIT == 64);

You can find CHAR_BIT defined in <limits.h> or <climits>.

How to have both 32bit and 64bit float in C++

This might seem obvious, nevertheless:

On Intel platform and many others float is 32-bit floating point value, and double is 64-bit floating point value. Try this approach. Most likely it will work.

To be absolutely sure check sizeof of your types at the start of your program or statically during compilation if your compiler allows this.

Once again, try the simple solution first.

Float and double arithmetic is both implemented on Intel and it is fast. In any case native arithmetic is the fastest of what you can get from the CPU.

IEEE 754 (http://en.wikipedia.org/wiki/IEEE_floating_point) defines not one floating point format, but several, like 4, 8, 16 bytes, etc. They all have different range and precision but they are all still IEEE values.

C++ ensuring arithmetic result involving literals is 64 bits not 32 bits

You can use a suffix on the first literal to promote it to the correct size. In this case you can use

const uint64_t x = 1'000'000'000ull * 60 * 5;

to make 1'000'000'000 an unsigned long long which is at least 64 bits wide. This also has the affect of promoting 60 and 5 to be unsigned long long's as well when the multiplication is done.

acos(double) gives different result on x64 and x32 Visual Studio

TL:DR: this is normal and you can't reasonably change it.


The 32-bit library may be using 80-bit FP values in x87 registers for its temporaries, avoiding rounding off to 64-bit double after every operation. (Unless there's a whole separate library, compiling your own code to use SSE doesn't change what's inside the library, or even the calling convention for passing data to the library. But since 32-bit passes double and float in memory on the stack, a library is free to load it with SSE2 or with x87. Still, you don't get the performance advantage of passing FP values in xmm registers unless it's impossible for non-SSE code to use the library.)

It's also possible that they're different simply because they use a different order of operations, producing different temporaries along the way. That's less plausible, unless they're separately hand-written in asm. If they're built from the same C source (without "unsafe" FP optimizations), then the compiler isn't allowed to reorder things, because of this non-associative behaviour of FP math.


glibc's libm (used on Linux) typically favours precision over speed, so its giving you the correctly-rounded result out to the last bit of the mantissa for both 32 and 64-bit. The IEEE FP standard only requires the basic operations (+ - * / FMA and FP remainder) to be "correctly rounded" out to the last bit of the mantissa. (i.e. rounding error of at most 0.5 ulp). (The exact result, according to calc, is 1.047304076386807714.... Keep in mind that double (on x86 with normal compilers) is IEEE754 binary64, so internally the mantissa and exponent are in base2. If you print enough extra decimal digits, though, you can tell that ...7714 should round up to ...78, although really you should print more digits in case they're not zero beyond that. I'm just assuming it's ...78000.)

So Microsoft's 64-bit library implementation produces 1.0473040763868076 and there's pretty much nothing you can do about it, other than not use it. (e.g. find your own acos() implementation and use it.) But FP determinism is hard, even if you limit yourself to just x86 with SSE. See Does any floating point-intensive code produce bit-exact results in any x86-based architecture?. If you limit yourself to a single compiler, it can be possible if you avoid complicated library functions like acos().

You might be able to get the 32-bit library version to produce the same value as the 64-bit version, if it uses x87 and changing the x87 precision setting affects it. But the other way around is not possible: SSE2 has separate instructions for 64-bit double and 32-bit float, and always rounds after every instruction, so you can't change any setting that will increase precision result. (You could change the SSE rounding mode, and that will change the result, but not in a good way!)

See also:

  • Intermediate Floating-Point Precision and the rest of Bruce Dawson's excellent series of articles about floating point. (table of contents.

    The linked article describes how some versions of VC++'s CRT runtime startup set the x87 FP register precision to 53-bit mantissa instead of 80-bit full precision. Also that D3D9 will set it to 24, so even double only has the precision of float if done with x87.

  • https://en.wikipedia.org/wiki/Rounding#Table-maker.27s_dilemma

  • What Every Computer Scientist Should Know About Floating-Point Arithmetic

64 bit floating point porting issues

There is no inherent need for floats and doubles to behave differently between 32-bit and 64-bit code but frequently they do. The answer to your question is going to be platform and compiler specific so you need to say what platform you are porting from and what platform you are porting to.

On intel x86 platforms 32-bit code often uses the x87 co-processor instruction set and floating-point register stack for maximum compatibility whereas on amb64/x86_64 platforms, the SSE* instructions and xmm* registers and are often used instead. These have different precision characteristics.

Post edit:

Given your platform, you might want to consider trying the -mfpmath=387 (the default for i386 gcc) on your x86_64 build to see if this explains the differing results. You may also want to look at the settings for all the -fmath-* compiler switches to ensure that they match what you want in both builds.

How to guarantee exact size of double in C?

How to guarantee exact size of double in C?

Use _Static_assert()

#include <limits.h>

int main(void) {
_Static_assert(sizeof (double)*CHAR_BIT == 64, "Unexpected double size");
return 0;
}

_Static_assert available since C11. Otherwise code could use a run-time assert.

#include <assert.h>
#include <limits.h>

int main(void) {
assert(sizeof (double)*CHAR_BIT == 64);
return 0;
}

Although this will insure the size of a double is 64, it does not insure IEEE 754 double-precision binary floating-point format adherence.

Code could use __STDC_IEC_559__

An implementation that defines __STDC_IEC_559__ shall conform to the specifications in this annex` C11 Annex F IEC 60559 floating-point arithmetic

Yet that may be too strict. Many implementations adhere to most of that standard, yet still do no set the macro.


would there be some standard way to guarantee size of a floating point type or double?

The best guaranteed is to write the FP value as its hex representation or as an exponential with sufficient decimal digits. See Printf width specifier to maintain precision of floating-point value

char* to double and back to char* again ( 64 bit application)

On 64-bit Windows pointers are 64-bit while int is 32-bit. This is why you're losing data in the upper 32-bits while casting. Instead of int use long long to hold the intermediate result.

char* hello = "hello";
unsigned long long hello_to_int = (unsigned long long)hello;

Make similar changes for the reverse conversion. But this is not guaranteed to make the conversions function correctly because a double can easily represent the entire 32-bit integer range without loss of precision but the same is not true for a 64-bit integer.

Also, this isn't going to work

unsigned int converted_int = (unsigned int)hello_to_double;

That conversion will simply truncate anything digits after the decimal point in the floating point representation. The problem exists even if you change the data type to unsigned long long. You'll need to reinterpret_cast<unsigned long long> to make it work.

Even after all that you may still run into trouble depending on the value of the pointer. The conversion to double may cause the value to be a signalling NaN for instance, in which cause your code might throw an exception.

Simple answer is, unless you're trying this out for fun, don't do conversions like these.



Related Topics



Leave a reply



Submit