Converting float to double loses precision C#
The issue observed in this question is caused largely by Microsoft’s choice of formatting, notably that Microsoft software fails to show the exact values because it limits the number of digits used to convert to decimal even when the format string requests more digits. Furthermore, it uses fewer digits when converting float
than when converting double
. Thus, if a float
and double
with the same value are formatted, the results may be different because the float
formatting will use fewer significant digits.
Below, I go through the code statements in the question one by one. In summary, the crux of the matter is that the value 61.0099983215332 is formatted as “61.0100000000000” when it is a float
and “61.0099983215332” when it is a double
. This is purely Microsoft’s choice of formatting and is not caused by the nature of floating-point arithmetic.
The statement double temp3 = 61.01
initializes temp3
to exactly 61.00999999999999801048033987171947956085205078125. This change from 61.01 is necessary due to the nature of a binary floating-point format—it cannot represent exactly 61.01, so the nearest value representable in double
is used.
The statement dynamic temp = 61.01f
initializes temp
to exactly 61.009998321533203125. As with double
, the nearest representable value has been used, but, since float
has less precision, the nearest value is not as close as in the double
case.
The statement double temp2 = (double)Convert.ChangeType(temp, typeof(double));
converts temp
to a double
that has the same value as temp
, so it has the value 61.009998321533203125.
The statement double newValue = temp2 - temp3;
correctly subtracts the two values, producing the exact result 0.00000167846679488548033987171947956085205078125, with no error.
The statement Console.WriteLine(String.Format(" {0:F20}", temp));
formats the float
named temp
. Formatting a float
involves callling Single.ToString
. Microsoft‘s documentation is a bit vague. It says that, by default, only seven (decimal) digits of precision are returned. It says to use G
or R
formats to get up to nine, and F20
uses neither G
nor R
. So I believe only seven digits are used. When 61.009998321533203125 is rounded to seven significant decimal digits, the result is “61.01000”. The ToString
method then pads this to twenty digits after the decimal point, producing “61.01000000000000000000”.
I will address your third WriteLine
statement next and come back to the second one afterward.
The statement Console.WriteLine(String.Format(" {0:F20}", temp3));
formats the double
named temp3
. Since temp3
is a double
, Double.ToString
is called. This method uses 15 digits of precision (unless G
orR
are used). When 61.00999999999999801048033987171947956085205078125 is rounded to 15 significant decimal digits, the result is “61.0100000000000”. The ToString
method then pads this to twenty digits after the decimal point, producing “61.01000000000000000000”.
The statement Console.WriteLine(String.Format(" {0:F20}", temp2));
formats the double
named temp2
. temp2
is a double
that contains the value from the float
temp
, so it contains 61.009998321533203125. When this is converted to 15 significant decimal digits, the result is “61.0099983215332”. The ToString
method then pads this to twenty digits after the decimal point, producing “61.00999832153320000000”.
Finally, the statement Console.WriteLine(String.Format(" {0:F20}", newValue));
formats newValue
. Formatting .00000167846679488548033987171947956085205078125 to 15 significant digits produces “0.00000167846679488548”.
How float is converted to double in java?
The reason why there's such issue is because a computer works only in discrete mathematics, because the microprocessor can only represent internally full numbers, but no decimals. Because we cannot only work with such numbers, but also with decimals, to circumvent that, decades ago very smart engineers have invented the floating point representation, normalized as IEEE754.
The IEEE754 norm that defines how floats and doubles are interpreted in memory. Basically, unlike the int which represent an exact value, the floats and doubles are a calculation from:
- sign
- exponent
- fraction
So the issue here is that when you're storing 1.2
as a double, you actually store a binary approximation to it:
00111111100110011001100110011010
which gives you the closest representation of 1.2 that can be stored using a binary fraction, but not exactly that fraction. In decimal fraction, 12*10^-1
gives an exact value, but as a binary fraction, it cannot give an exact value.
(cf http://www.h-schmidt.net/FloatConverter/IEEE754.html as I'm too lazy to do it myself)
when I store 1.2 in a double variable 'y' it becomes 1.200000025443 something
well actually in both the float and the double versions of y
, the value actually is 1.2000000476837158
, but because of the smaller mantissa of the float, the value represented is truncated before the approximation, making you believe it's an exact value, whereas in the memory it's not.
Convert float to double without losing precision
It's not that you're actually getting extra precision - it's that the float didn't accurately represent the number you were aiming for originally. The double is representing the original float accurately; toString
is showing the "extra" data which was already present.
For example (and these numbers aren't right, I'm just making things up) suppose you had:
float f = 0.1F;
double d = f;
Then the value of f
might be exactly 0.100000234523. d
will have exactly the same value, but when you convert it to a string it will "trust" that it's accurate to a higher precision, so won't round off as early, and you'll see the "extra digits" which were already there, but hidden from you.
When you convert to a string and back, you're ending up with a double value which is closer to the string value than the original float was - but that's only good if you really believe that the string value is what you really wanted.
Are you sure that float/double are the appropriate types to use here instead of BigDecimal
? If you're trying to use numbers which have precise decimal values (e.g. money), then BigDecimal
is a more appropriate type IMO.
Does converting a float to a double and back to float give the same value in C++
Here are some clues but not the answer:
4.6 A prvalue of type float can be converted to a prvalue of type double. The value is unchanged. This conversion is called floating point promotion.
...4.8 A prvalue of floating point type can be converted to a prvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values.
Precision loss from float to double, and from double to float?
SHOULD fv be equal to original_value exactly? Any precision may be
lost?
Yes, if the value of dv
did not change in between.
From section Conversion 6.3.1.5 Real Floating types in C99 specs:
- When a float is promoted to double or long double, or a double is
promoted to long double, its value is unchanged.- When a double is
demoted to float, a long double is demoted to double or float, or a
value being represented in greater precision and range than required
by its semantic type (see 6.3.1.8) is explicitly converted to its
semantic type, if the value being converted can be represented exactly
in the new type, it is unchanged. If the value being converted is in
the range of values that can be represented but cannot be represented
exactly, the result is either the nearest higher or nearest lower
representable value, chosen in an implementation-defined manner. If
the value being converted is outside the range of values that can be
represented, the behavior is undefined
For C++, from section 4.6 aka conv.fpprom (draft used: n337 and I believe similar lines are available in final specs)
A prvalue of type float can be converted to a prvalue of type double.
The value is unchanged. This conversion is called floating point
promotion.
And section 4.8 aka conv.double
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined. The conversions allowed as floating point
promotions are excluded from the set of floating point conversions
So the values should be equal exactly.
C# float to double conversion
(Actually, when I run the code, I got 18060.13
instead of 18060.125
, but I will keep using the latter in my answer.)
Can I find the nearest double value for the given float value?
You seem to somehow think that the nearest double value for the float 18060.125
is 18060.124145507813
? This is not true. The nearest double value for the float 18060.125
is 18060.125
. This value can be represented by double
and float
equally accurately.
Why does casting
18060.124145507813
tofloat
gives18060.125
then?
Because the nearest float
to the double
18060.124145507813
is 18060.125
. Note that this is the other way round from your understanding. This does not imply that the nearest double
to the float
18060.125
is 18060.124145507813
, because there are many double
values in between 2 adjacent float
values.
It is impossible to go back to "the double
that you got the float
from" because when you cast to float
, you are losing information. You are converting from a 64-bit value to a 32-bit one. That information isn't going back.
Why does casting 125.32f work then?
Because float
cannot represent the number 125.32 as accurately as double
can, so when you cast to double, it tries to approximate it even further. Although it might seem float
can represent 125.32
100% accurately, that's just an illusion created by the ToString
method. Always format your floating point numbers with some kind of formatting method, e.g. string.Format
.
When does appending an 'f' change the value of a floating constant when assigned to a `float`?
This is a self answer per Answer Your Own Question.
Appending an f
makes the constant a float
and sometimes makes a value difference.
Type
Type difference: double
to float
.
A well enabled compiler may emit a warning when the f
is omitted too.
float f = 3.1415926535897932; // May generate a warning
warning: conversion from 'double' to 'float' changes value from '3.1415926535897931e+0' to '3.14159274e+0f' [-Wfloat-conversion]
Value
To make a value difference, watch out for potential double rounding issues.
The first rounding is due to code's text being converted to the floating point type.
the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner. C17dr § 6.4.4.2 3
Given those two choices, a very common implementation-defined manner is to convert the source code text to the closest double
(without the f
) or to the closest float
with the f
suffix. Lesser quality implementations sometimes form the 2nd closest choice.
Assignment of a double
FP constant to a float
incurs another rounding.
If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. C17dr § 6.3.1.4 2
A very common implementation-defined manner is to convert the double
to the closest float
- with ties to even. (Note: compile time rounding may be affected by various compiler settings.)
Double rounding value change
Consider the case when source code uses a value very close to half-way between 2 float
values.
Without an f
, the rounding of code to a double
may result in a value exactly half-way between 2 float
s. The conversion of the double
to float
then could differ from "with an f
".
With an f
, the conversion results in the closest float
.
Example:
#include <math.h>
#include <stdio.h>
int main(void) {
float f;
f = 10000000.0f;
printf("%.6a %.3f 10 million\n", f, f);
f = nextafterf(f, f + f);
printf("%.6a %.3f 10 million - next float\n", f, f);
puts("");
f = 10000000.5000000001;
printf("%.6a %.3f 10000000.5000000001\n", f, f);
f = 10000000.5000000001f;
printf("%.6a %.3f 10000000.5000000001f\n", f, f);
puts("");
f = 10000001.4999999999;
printf("%.6a %.3f 10000001.4999999999\n", f, f);
f = 10000001.4999999999f;
printf("%.6a %.3f 10000001.4999999999f\n", f, f);
}
Output
0x1.312d00p+23 10000000.000 10 million
0x1.312d02p+23 10000001.000 10 million - next float
// value value source code
0x1.312d00p+23 10000000.000 10000000.5000000001
0x1.312d02p+23 10000001.000 10000000.5000000001f // Different, and better
0x1.312d04p+23 10000002.000 10000001.4999999999
0x1.312d02p+23 10000001.000 10000001.4999999999f // Different, and better
Rounding mode
The issue about double1 rounding is less likely when the rounding mode is up, down or towards zero. Issue arises when the 2nd rounding compounds the direction on half-way cases.
Occurrence rate
Issue occurs when code converts inexactly to a double
that is very near half-way between 2 float
values - so relatively rare. Issue applies even if the code constant was in decimal or hexadecimal form. With random constants: about 1 in 230.
Recommendation
Rarely a major concern, yet an f
suffix is better to get the best value for a float
and quiet a warning.
[Update 2022]
The issue is further complicated under 2 conditions:
FLT_EVAL_METHOD == 2
, then the constant maybe evaluated usinglong double
math.Evaluation of floating point constants may ignore decimal digits past a certain precision. This is allowed in C and IEEE 754. Typically this is
XXX_DECIMAL_DIG + 3
digits (e.g. 20 fordouble
).
These complications change the chance of seeing this issue. Still the conclusion remains: append f
to get the best float
constant.
1 double here refers to doing something twice, not the the type double
.
Related Topics
How to Convert a String to Another Locale
"Loading Class Com.Mysql.Jdbc.Driver ... Is Deprecated" Message
Changing Swing Jtable Cell Colors
Jfreechart Series Tool Tip Above Shape Annotation
Issues with Swingworker and Jprogressbar
Code for Changing the Color of Subtasks in Gantt Chart
Drawing an Object Using Getgraphics() Without Extending Jframe
What Does an Assignment Expression Evaluate to in Java
Running Multiple Launch Configurations at Once
"Int Cannot Be Dereferenced" in Java
Java Date - Insert into Database
How to Make My Swingworker Example Work Properly
Bidirectional Multi-Valued Map in Java
Why Does List<String>.Toarray() Return Object[] and Not String[]? How to Work Around This
Drawing in Jlayeredpane Over Exising JPAnels