Why Converting from Float to Double Changes the Value

Converting float to double loses precision C#

The issue observed in this question is caused largely by Microsoft’s choice of formatting, notably that Microsoft software fails to show the exact values because it limits the number of digits used to convert to decimal even when the format string requests more digits. Furthermore, it uses fewer digits when converting float than when converting double. Thus, if a float and double with the same value are formatted, the results may be different because the float formatting will use fewer significant digits.

Below, I go through the code statements in the question one by one. In summary, the crux of the matter is that the value 61.0099983215332 is formatted as “61.0100000000000” when it is a float and “61.0099983215332” when it is a double. This is purely Microsoft’s choice of formatting and is not caused by the nature of floating-point arithmetic.

The statement double temp3 = 61.01 initializes temp3 to exactly 61.00999999999999801048033987171947956085205078125. This change from 61.01 is necessary due to the nature of a binary floating-point format—it cannot represent exactly 61.01, so the nearest value representable in double is used.

The statement dynamic temp = 61.01f initializes temp to exactly 61.009998321533203125. As with double, the nearest representable value has been used, but, since float has less precision, the nearest value is not as close as in the double case.

The statement double temp2 = (double)Convert.ChangeType(temp, typeof(double)); converts temp to a double that has the same value as temp, so it has the value 61.009998321533203125.

The statement double newValue = temp2 - temp3; correctly subtracts the two values, producing the exact result 0.00000167846679488548033987171947956085205078125, with no error.

The statement Console.WriteLine(String.Format(" {0:F20}", temp)); formats the float named temp. Formatting a float involves callling Single.ToString. Microsoft‘s documentation is a bit vague. It says that, by default, only seven (decimal) digits of precision are returned. It says to use G or R formats to get up to nine, and F20 uses neither G nor R. So I believe only seven digits are used. When 61.009998321533203125 is rounded to seven significant decimal digits, the result is “61.01000”. The ToString method then pads this to twenty digits after the decimal point, producing “61.01000000000000000000”.

I will address your third WriteLine statement next and come back to the second one afterward.

The statement Console.WriteLine(String.Format(" {0:F20}", temp3)); formats the double named temp3. Since temp3 is a double, Double.ToString is called. This method uses 15 digits of precision (unless G orR are used). When 61.00999999999999801048033987171947956085205078125 is rounded to 15 significant decimal digits, the result is “61.0100000000000”. The ToString method then pads this to twenty digits after the decimal point, producing “61.01000000000000000000”.

The statement Console.WriteLine(String.Format(" {0:F20}", temp2)); formats the double named temp2. temp2 is a double that contains the value from the float temp, so it contains 61.009998321533203125. When this is converted to 15 significant decimal digits, the result is “61.0099983215332”. The ToString method then pads this to twenty digits after the decimal point, producing “61.00999832153320000000”.

Finally, the statement Console.WriteLine(String.Format(" {0:F20}", newValue)); formats newValue. Formatting .00000167846679488548033987171947956085205078125 to 15 significant digits produces “0.00000167846679488548”.

How float is converted to double in java?

The reason why there's such issue is because a computer works only in discrete mathematics, because the microprocessor can only represent internally full numbers, but no decimals. Because we cannot only work with such numbers, but also with decimals, to circumvent that, decades ago very smart engineers have invented the floating point representation, normalized as IEEE754.

The IEEE754 norm that defines how floats and doubles are interpreted in memory. Basically, unlike the int which represent an exact value, the floats and doubles are a calculation from:

floating point representation

  • sign
  • exponent
  • fraction

So the issue here is that when you're storing 1.2 as a double, you actually store a binary approximation to it:

00111111100110011001100110011010

which gives you the closest representation of 1.2 that can be stored using a binary fraction, but not exactly that fraction. In decimal fraction, 12*10^-1 gives an exact value, but as a binary fraction, it cannot give an exact value.

(cf http://www.h-schmidt.net/FloatConverter/IEEE754.html as I'm too lazy to do it myself)

when I store 1.2 in a double variable 'y' it becomes 1.200000025443 something

well actually in both the float and the double versions of y, the value actually is 1.2000000476837158, but because of the smaller mantissa of the float, the value represented is truncated before the approximation, making you believe it's an exact value, whereas in the memory it's not.

Convert float to double without losing precision

It's not that you're actually getting extra precision - it's that the float didn't accurately represent the number you were aiming for originally. The double is representing the original float accurately; toString is showing the "extra" data which was already present.

For example (and these numbers aren't right, I'm just making things up) suppose you had:

float f = 0.1F;
double d = f;

Then the value of f might be exactly 0.100000234523. d will have exactly the same value, but when you convert it to a string it will "trust" that it's accurate to a higher precision, so won't round off as early, and you'll see the "extra digits" which were already there, but hidden from you.

When you convert to a string and back, you're ending up with a double value which is closer to the string value than the original float was - but that's only good if you really believe that the string value is what you really wanted.

Are you sure that float/double are the appropriate types to use here instead of BigDecimal? If you're trying to use numbers which have precise decimal values (e.g. money), then BigDecimal is a more appropriate type IMO.

Does converting a float to a double and back to float give the same value in C++

Here are some clues but not the answer:

4.6 A prvalue of type float can be converted to a prvalue of type double. The value is unchanged. This conversion is called floating point promotion.
...

4.8 A prvalue of floating point type can be converted to a prvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values.

Precision loss from float to double, and from double to float?

SHOULD fv be equal to original_value exactly? Any precision may be
lost?

Yes, if the value of dv did not change in between.

From section Conversion 6.3.1.5 Real Floating types in C99 specs:

  1. When a float is promoted to double or long double, or a double is
    promoted to long double, its value is unchanged.
  2. When a double is
    demoted to float, a long double is demoted to double or float, or a
    value being represented in greater precision and range than required
    by its semantic type (see 6.3.1.8) is explicitly converted to its
    semantic type, if the value being converted can be represented exactly
    in the new type, it is unchanged. If the value being converted is in
    the range of values that can be represented but cannot be represented
    exactly, the result is either the nearest higher or nearest lower
    representable value, chosen in an implementation-defined manner. If
    the value being converted is outside the range of values that can be
    represented, the behavior is undefined

For C++, from section 4.6 aka conv.fpprom (draft used: n337 and I believe similar lines are available in final specs)

A prvalue of type float can be converted to a prvalue of type double.
The value is unchanged. This conversion is called floating point
promotion.

And section 4.8 aka conv.double

A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type
, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined. The conversions allowed as floating point
promotions are excluded from the set of floating point conversions

So the values should be equal exactly.

C# float to double conversion

(Actually, when I run the code, I got 18060.13 instead of 18060.125, but I will keep using the latter in my answer.)

Can I find the nearest double value for the given float value?

You seem to somehow think that the nearest double value for the float 18060.125 is 18060.124145507813? This is not true. The nearest double value for the float 18060.125 is 18060.125. This value can be represented by double and float equally accurately.

Why does casting 18060.124145507813 to float gives 18060.125 then?

Because the nearest float to the double 18060.124145507813 is 18060.125. Note that this is the other way round from your understanding. This does not imply that the nearest double to the float 18060.125 is 18060.124145507813, because there are many double values in between 2 adjacent float values.

It is impossible to go back to "the double that you got the float from" because when you cast to float, you are losing information. You are converting from a 64-bit value to a 32-bit one. That information isn't going back.

Why does casting 125.32f work then?

Because float cannot represent the number 125.32 as accurately as double can, so when you cast to double, it tries to approximate it even further. Although it might seem float can represent 125.32 100% accurately, that's just an illusion created by the ToString method. Always format your floating point numbers with some kind of formatting method, e.g. string.Format.

When does appending an 'f' change the value of a floating constant when assigned to a `float`?

This is a self answer per Answer Your Own Question.

Appending an f makes the constant a float and sometimes makes a value difference.


Type

Type difference: double to float.

A well enabled compiler may emit a warning when the f is omitted too.

  float f = 3.1415926535897932;  // May generate a warning

warning: conversion from 'double' to 'float' changes value from '3.1415926535897931e+0' to '3.14159274e+0f' [-Wfloat-conversion]


Value

To make a value difference, watch out for potential double rounding issues.

The first rounding is due to code's text being converted to the floating point type.

the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner. C17dr § 6.4.4.2 3

Given those two choices, a very common implementation-defined manner is to convert the source code text to the closest double (without the f) or to the closest float with the f suffix. Lesser quality implementations sometimes form the 2nd closest choice.

Assignment of a double FP constant to a float incurs another rounding.

If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. C17dr § 6.3.1.4 2

A very common implementation-defined manner is to convert the double to the closest float - with ties to even. (Note: compile time rounding may be affected by various compiler settings.)

Double rounding value change

Consider the case when source code uses a value very close to half-way between 2 float values.

Without an f, the rounding of code to a double may result in a value exactly half-way between 2 floats. The conversion of the double to float then could differ from "with an f".

With an f, the conversion results in the closest float.

Example:

#include <math.h>
#include <stdio.h>
int main(void) {
float f;
f = 10000000.0f;
printf("%.6a %.3f 10 million\n", f, f);
f = nextafterf(f, f + f);
printf("%.6a %.3f 10 million - next float\n", f, f);
puts("");
f = 10000000.5000000001;
printf("%.6a %.3f 10000000.5000000001\n", f, f);
f = 10000000.5000000001f;
printf("%.6a %.3f 10000000.5000000001f\n", f, f);
puts("");
f = 10000001.4999999999;
printf("%.6a %.3f 10000001.4999999999\n", f, f);
f = 10000001.4999999999f;
printf("%.6a %.3f 10000001.4999999999f\n", f, f);
}

Output

0x1.312d00p+23  10000000.000  10 million
0x1.312d02p+23 10000001.000 10 million - next float

// value value source code
0x1.312d00p+23 10000000.000 10000000.5000000001
0x1.312d02p+23 10000001.000 10000000.5000000001f // Different, and better

0x1.312d04p+23 10000002.000 10000001.4999999999
0x1.312d02p+23 10000001.000 10000001.4999999999f // Different, and better

Rounding mode

The issue about double1 rounding is less likely when the rounding mode is up, down or towards zero. Issue arises when the 2nd rounding compounds the direction on half-way cases.

Occurrence rate

Issue occurs when code converts inexactly to a double that is very near half-way between 2 float values - so relatively rare. Issue applies even if the code constant was in decimal or hexadecimal form. With random constants: about 1 in 230.

Recommendation

Rarely a major concern, yet an f suffix is better to get the best value for a float and quiet a warning.

[Update 2022]

The issue is further complicated under 2 conditions:

  • FLT_EVAL_METHOD == 2, then the constant maybe evaluated using long double math.

  • Evaluation of floating point constants may ignore decimal digits past a certain precision. This is allowed in C and IEEE 754. Typically this is XXX_DECIMAL_DIG + 3 digits (e.g. 20 for double).

These complications change the chance of seeing this issue. Still the conclusion remains: append f to get the best float constant.


1 double here refers to doing something twice, not the the type double.



Related Topics



Leave a reply



Submit