How Is Floating Point Conversion Actually Done in C++(Double to Float or Float to Double)

How is floating point conversion actually done in C++?(double to float or float to double)

The cvtsd2ss instruction uses the FPU's rounding mode to do the conversion. The default rounding mode is round-to-nearest-even.

In order to follow the algorithm, it helps to keep in mind the information at the IEEE 754-1985 Wikipedia page, especially the diagrams representing the layout.

First, the exponent of the target float is computed: the double type has a wider range than float, so the result may be 0.0f (or a denormal) for a very small double, or an infinite value for a very large double.

For the usual case of a normal double being converted to a normal float (roughly, when the unbiased exponent of the double can be represented in the 8 bits of a single-precision representation), the first 23 bits of the destination significand start out the same as the most significant of the original number's 52-bit significand.

Then there is the problem of rounding:

  • if the left-over bits are below 10..0, then the target significand is left as-is.

  • If the left-over bits are above 10..0, then the target significand is incremented. If incrementing it makes it overflow (because it is already 1..1), then the carry is propagated into the exponent bits. This produces the correct result because of the careful way the IEEE 754 layout has been designed.

  • If the bits left over are exactly 10..0, then the double is exactly midway between two floats. Of these two choices, the one with the last bit 0 (“even”) is chosen.

After this step, the target significand corresponds to the float nearest to the original double.

The directed rounding modes are only simpler. The case where the target float is a denormal is slightly more complicated (one must be careful to avoid “double-rounding”).

Converting a double to a float?

If you're sure this is what you want to do, the standard way to convert from a double to a float is:

float someFloat = 32.381562;
double someDouble = (double) someFloat;

It's called casting or type conversion, and you can read all about it here.

Does converting a float to a double and back to float give the same value in C++

Here are some clues but not the answer:

4.6 A prvalue of type float can be converted to a prvalue of type double. The value is unchanged. This conversion is called floating point promotion.
...

4.8 A prvalue of floating point type can be converted to a prvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values.

C: convert double to float, preserving decimal point precision

float and double don't store decimal places. They store binary places: float is (assuming IEEE 754) 24 significant bits (7.22 decimal digits) and double is 53 significant bits (15.95 significant digits).

Converting from double to float will give you the closest possible float, so rounding won't help you. Goining the other way may give you "noise" digits in the decimal representation.

#include <stdio.h>

int main(void) {
double orig = 12345.67;
float f = (float) orig;
printf("%.17g\n", f); // prints 12345.669921875
return 0;
}

To get a double approximation to the nice decimal value you intended, you can write something like:

double round_to_decimal(float f) {
return round(f * pow(10, 7)) / pow(10, 7);
}

Is literal double to float conversion equal to float literal?

Assuming IEEE 754, float as 32 bit binary, double as 64 bit binary.

There are decimal fractions that round differently, under IEEE 754 round-to-nearest rules, if converted directly from decimal to float from the result of first converting from decimal to double and then to float.

For example, consider 1.0000000596046447753906250000000000000000000000000001

1.000000059604644775390625 is exactly representable as a double and is exactly half way between 1.0 and 1.00000011920928955078125, the value of the smallest float greater than 1.0. 1.0000000596046447753906250000000000000000000000000001 rounds up to 1.00000011920928955078125 if converted directly, because it is greater than the mid point. If it is first converted to 64 bit, round to nearest takes it to the mid point 1.000000059604644775390625, and then round half even rounds down to 1.0.

C# float to double conversion

(Actually, when I run the code, I got 18060.13 instead of 18060.125, but I will keep using the latter in my answer.)

Can I find the nearest double value for the given float value?

You seem to somehow think that the nearest double value for the float 18060.125 is 18060.124145507813? This is not true. The nearest double value for the float 18060.125 is 18060.125. This value can be represented by double and float equally accurately.

Why does casting 18060.124145507813 to float gives 18060.125 then?

Because the nearest float to the double 18060.124145507813 is 18060.125. Note that this is the other way round from your understanding. This does not imply that the nearest double to the float 18060.125 is 18060.124145507813, because there are many double values in between 2 adjacent float values.

It is impossible to go back to "the double that you got the float from" because when you cast to float, you are losing information. You are converting from a 64-bit value to a 32-bit one. That information isn't going back.

Why does casting 125.32f work then?

Because float cannot represent the number 125.32 as accurately as double can, so when you cast to double, it tries to approximate it even further. Although it might seem float can represent 125.32 100% accurately, that's just an illusion created by the ToString method. Always format your floating point numbers with some kind of formatting method, e.g. string.Format.

Why are double preferred over float?

In my opinion the answers so far don't really get the right point across, so here's my crack at it.

The short answer is C++ developers use doubles over floats:

  • To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
  • Habit
  • Culture
  • To match library function signatures
  • To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)

It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.

However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.

Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:

  • They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
  • If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.

In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.

Difference between decimal, float and double in .NET?

float (the C# alias for System.Single) and double (the C# alias for System.Double) are floating binary point types. float is 32-bit; double is 64-bit. In other words, they represent a number like this:

10001.10010110011

The binary number and the location of the binary point are both encoded within the value.

decimal (the C# alias for System.Decimal) is a floating decimal point type. In other words, they represent a number like this:

12345.65789

Again, the number and the location of the decimal point are both encoded within the value – that's what makes decimal still a floating point type instead of a fixed point type.

The important thing to note is that humans are used to representing non-integers in a decimal form, and expect exact results in decimal representations; not all decimal numbers are exactly representable in binary floating point – 0.1, for example – so if you use a binary floating point value you'll actually get an approximation to 0.1. You'll still get approximations when using a floating decimal point as well – the result of dividing 1 by 3 can't be exactly represented, for example.

As for what to use when:

  • For values which are "naturally exact decimals" it's good to use decimal. This is usually suitable for any concepts invented by humans: financial values are the most obvious example, but there are others too. Consider the score given to divers or ice skaters, for example.

  • For values which are more artefacts of nature which can't really be measured exactly anyway, float/double are more appropriate. For example, scientific data would usually be represented in this form. Here, the original values won't be "decimally accurate" to start with, so it's not important for the expected results to maintain the "decimal accuracy". Floating binary point types are much faster to work with than decimals.



Related Topics



Leave a reply



Submit