How is floating point conversion actually done in C++?(double to float or float to double)
The cvtsd2ss
instruction uses the FPU's rounding mode to do the conversion. The default rounding mode is round-to-nearest-even.
In order to follow the algorithm, it helps to keep in mind the information at the IEEE 754-1985 Wikipedia page, especially the diagrams representing the layout.
First, the exponent of the target float
is computed: the double
type has a wider range than float
, so the result may be 0.0f
(or a denormal) for a very small double
, or an infinite value for a very large double.
For the usual case of a normal double
being converted to a normal float
(roughly, when the unbiased exponent of the double
can be represented in the 8 bits of a single-precision representation), the first 23 bits of the destination significand start out the same as the most significant of the original number's 52-bit significand.
Then there is the problem of rounding:
if the left-over bits are below
10..0
, then the target significand is left as-is.If the left-over bits are above
10..0
, then the target significand is incremented. If incrementing it makes it overflow (because it is already1..1
), then the carry is propagated into the exponent bits. This produces the correct result because of the careful way the IEEE 754 layout has been designed.If the bits left over are exactly
10..0
, then thedouble
is exactly midway between twofloat
s. Of these two choices, the one with the last bit0
(“even”) is chosen.
After this step, the target significand corresponds to the float
nearest to the original double
.
The directed rounding modes are only simpler. The case where the target float
is a denormal is slightly more complicated (one must be careful to avoid “double-rounding”).
Converting a double to a float?
If you're sure this is what you want to do, the standard way to convert from a double to a float is:
float someFloat = 32.381562;
double someDouble = (double) someFloat;
It's called casting or type conversion, and you can read all about it here.
Does converting a float to a double and back to float give the same value in C++
Here are some clues but not the answer:
4.6 A prvalue of type float can be converted to a prvalue of type double. The value is unchanged. This conversion is called floating point promotion.
...4.8 A prvalue of floating point type can be converted to a prvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values.
C: convert double to float, preserving decimal point precision
float
and double
don't store decimal places. They store binary places: float
is (assuming IEEE 754) 24 significant bits (7.22 decimal digits) and double is 53 significant bits (15.95 significant digits).
Converting from double
to float
will give you the closest possible float
, so rounding won't help you. Goining the other way may give you "noise" digits in the decimal representation.
#include <stdio.h>
int main(void) {
double orig = 12345.67;
float f = (float) orig;
printf("%.17g\n", f); // prints 12345.669921875
return 0;
}
To get a double
approximation to the nice decimal value you intended, you can write something like:
double round_to_decimal(float f) {
return round(f * pow(10, 7)) / pow(10, 7);
}
Is literal double to float conversion equal to float literal?
Assuming IEEE 754, float as 32 bit binary, double as 64 bit binary.
There are decimal fractions that round differently, under IEEE 754 round-to-nearest rules, if converted directly from decimal to float from the result of first converting from decimal to double and then to float.
For example, consider 1.0000000596046447753906250000000000000000000000000001
1.000000059604644775390625 is exactly representable as a double and is exactly half way between 1.0 and 1.00000011920928955078125, the value of the smallest float greater than 1.0. 1.0000000596046447753906250000000000000000000000000001 rounds up to 1.00000011920928955078125 if converted directly, because it is greater than the mid point. If it is first converted to 64 bit, round to nearest takes it to the mid point 1.000000059604644775390625, and then round half even rounds down to 1.0.
C# float to double conversion
(Actually, when I run the code, I got 18060.13
instead of 18060.125
, but I will keep using the latter in my answer.)
Can I find the nearest double value for the given float value?
You seem to somehow think that the nearest double value for the float 18060.125
is 18060.124145507813
? This is not true. The nearest double value for the float 18060.125
is 18060.125
. This value can be represented by double
and float
equally accurately.
Why does casting
18060.124145507813
tofloat
gives18060.125
then?
Because the nearest float
to the double
18060.124145507813
is 18060.125
. Note that this is the other way round from your understanding. This does not imply that the nearest double
to the float
18060.125
is 18060.124145507813
, because there are many double
values in between 2 adjacent float
values.
It is impossible to go back to "the double
that you got the float
from" because when you cast to float
, you are losing information. You are converting from a 64-bit value to a 32-bit one. That information isn't going back.
Why does casting 125.32f work then?
Because float
cannot represent the number 125.32 as accurately as double
can, so when you cast to double, it tries to approximate it even further. Although it might seem float
can represent 125.32
100% accurately, that's just an illusion created by the ToString
method. Always format your floating point numbers with some kind of formatting method, e.g. string.Format
.
Why are double preferred over float?
In my opinion the answers so far don't really get the right point across, so here's my crack at it.
The short answer is C++ developers use doubles over floats:
- To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
- Habit
- Culture
- To match library function signatures
- To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)
It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.
However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.
Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:
- They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
- If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.
In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.
Difference between decimal, float and double in .NET?
float
(the C# alias for System.Single
) and double
(the C# alias for System.Double
) are floating binary point types. float
is 32-bit; double
is 64-bit. In other words, they represent a number like this:
10001.10010110011
The binary number and the location of the binary point are both encoded within the value.
decimal
(the C# alias for System.Decimal
) is a floating decimal point type. In other words, they represent a number like this:
12345.65789
Again, the number and the location of the decimal point are both encoded within the value – that's what makes decimal
still a floating point type instead of a fixed point type.
The important thing to note is that humans are used to representing non-integers in a decimal form, and expect exact results in decimal representations; not all decimal numbers are exactly representable in binary floating point – 0.1, for example – so if you use a binary floating point value you'll actually get an approximation to 0.1. You'll still get approximations when using a floating decimal point as well – the result of dividing 1 by 3 can't be exactly represented, for example.
As for what to use when:
For values which are "naturally exact decimals" it's good to use
decimal
. This is usually suitable for any concepts invented by humans: financial values are the most obvious example, but there are others too. Consider the score given to divers or ice skaters, for example.For values which are more artefacts of nature which can't really be measured exactly anyway,
float
/double
are more appropriate. For example, scientific data would usually be represented in this form. Here, the original values won't be "decimally accurate" to start with, so it's not important for the expected results to maintain the "decimal accuracy". Floating binary point types are much faster to work with than decimals.
Related Topics
Simple For() Loop Benchmark Takes the Same Time with Any Loop Bound
How to Benchmark Boost Spirit Parser
Problem with Compiling Rinside Examples Under Windows
How to Vertically Align Text in Edit Box
Expand MACro Inside String Literal
Dangling References and Undefined Behavior
Conditionally Replace Regex Matches in String
Is It Allowed to Cast Away Const on a Const-Defined Object as Long as It Is Not Actually Modified
What Are the Differences Between Std::Variant and Boost::Variant
Why Is Taking the Address of a Temporary Illegal