C# Float Expression: Strange Behavior When Casting the Result Float to Int

C# Float expression: strange behavior when casting the result float to int

First of all, I assume that you know that 6.2f * 10 is not exactly 62 due to floating point rounding (it's actually the value 61.99999809265137 when expressed as a double) and that your question is only about why two seemingly identical computations result in the wrong value.

The answer is that in the case of (int)(6.2f * 10), you are taking the double value 61.99999809265137 and truncating it to an integer, which yields 61.

In the case of float f = 6.2f * 10, you are taking the double value 61.99999809265137 and rounding to the nearest float, which is 62. You then truncate that float to an integer, and the result is 62.

Exercise: Explain the results of the following sequence of operations.

double d = 6.2f * 10;
int tmp2 = (int)d;
// evaluate tmp2

Update: As noted in the comments, the expression 6.2f * 10 is formally a float since the second parameter has an implicit conversion to float which is better than the implicit conversion to double.

The actual issue is that the compiler is permitted (but not required) to use an intermediate which is higher precision than the formal type (section 11.2.2). That's why you see different behavior on different systems: In the expression (int)(6.2f * 10), the compiler has the option of keeping the value 6.2f * 10 in a high precision intermediate form before converting to int. If it does, then the result is 61. If it does not, then the result is 62.

In the second example, the explicit assignment to float forces the rounding to take place before the conversion to integer.

C#: Wrong result when converting expression with floats to int

The compiler uses extra precision when computing some expressions. In the C# language specification, clause 9.3.7 allows an implementation to use more precision in a floating-point expression than the result type:

Floating-point operations may be performed with higher precision than the result type of the operation.

Note that the value of .05f is 0.0500000007450580596923828125. When .05f * 1000.0f is computed with float precision, the result is 50, due to rounding. However, when it is computed with double or greater precision, the result is 50.0000007450580596923828125. Then dividing 100 by that with double precision produces 1.999999970197678056393897350062616169452667236328125. When this is converted to int, the result is 1.

In float c = a / (b * 1000.0f);, the result of the division is converted to float. Even if the division is computed with double precision and produces 1.999999970197678056393897350062616169452667236328125, this value becomes 2 when rounded to float, so c is set to 2.

In int res = (int)(a / (b * 1000.0f));, the result of the division is not converted to float. If the compiler computes it with double precision, the result is 1.999999970197678056393897350062616169452667236328125, and converting that produces 1.

Error in simple float calculations

The number 0.7 cannot be represented exactly by a float, instead the value of s is closer to 0.699999988079071044921875.

The int value of q will be converted to a float, as this can be represented directly it stays as 150.

If you multiply the two together you won't get 105 exactly:

q = 150

s = 0.699999988079071044921875

q * s = 104.999998211861

Now refer to the relevant part in the CLI Spec (ECMA-335) §12.1.3:

When a floating-point value whose internal representation has greater range and/or precision than its nominal type is put in a storage location, it is automatically coerced to the type of the storage location. This can involve
a loss of precision or the creation of an out-of-range value (NaN, +infinity, or -infinity). However, the value might be retained in the internal representation for future use, if it is reloaded from the storage location without having been modified. It is the responsibility of the compiler to ensure that the retained value is still valid at the time of a subsequent load, taking into account the effects of aliasing and other execution threads (see memory model (§12.6)). This freedom to carry extra precision is not permitted, however, following the execution of an explicit conversion (conv.r4 or conv.r8), at which time the internal representation must be exactly representable in the associated type.

So q * s results in a value with higher precision than float can handle. When storing this directly to an int:

var z1 = (int)(q * s);

The value is never coerced to the type float, but directly cast to int and thereby truncated to 104.

In all other examples the value was cast to or stored in a float and therefore converted to the nearest possible float value, which is 105.

C# strange precision lost int to float and backwards

This was a comment to a now deleted answer:

The integer 28218681 can be written in binary as 1101011101001010100111001. Note that 25 digits are needed. Single precision has only 24 bits for its "mantissa" (including the implicit leading 1 bit). 24 is less than 25. Precision is lost. A single-precision representation "remembers" a number only by its leading 24 binary digits. That corresponds to roughly 7-8 decimal figures. Roughly. The integer 28218681 has just 8 figures, so the problem arises.


The lesson learned is, use a type that is "wide" enough to give the desired precision. For example a double precision number can hold the first ~16 decimal figures of a number.

This is not related to the discussion on whether to use a binary or a decimal format. Note that if the asker had used decimal32 instead of binary32 (another name for float), he would have had the exact same issue!

Strange compiler behavior with float literals vs float variables

Your question can be simplified to asking why these two results are different:

float f = 2.0499999f;
var a = f * 100f;
var b = (int)(f * 100f);
var d = (int)a;
Console.WriteLine(b);
Console.WriteLine(d);

If you look at the code in .NET Reflector you can see that the above code is actually compiled as if it were the following code:

float f = 2.05f;
float a = f * 100f;
int b = (int) (f * 100f);
int d = (int) a;
Console.WriteLine(b);
Console.WriteLine(d);

Floating point calculations cannot always be made exactly. The result of 2.05 * 100f is not exactly equal to 205, but just a little less due to rounding errors. When this intermediate result is converted to an integer is truncated. When stored as a float it is rounded to the nearest representable form. These two methods of rounding give different results.


Regarding your comment to my answer when you write this:

Console.WriteLine((int) (2.0499999f * 100f));
Console.WriteLine((int)(float)(2.0499999f * 100f));

The calculations are done entirely in the compiler. The above code is equivalent to this:

Console.WriteLine(204);
Console.WriteLine(205);

Strange behavior when casting an int to float in C

In both cases, code seeks to convert from some integer type to float and then to double.. The double conversion occurs as it is a float value passed to a variadic function.

Check your setting of FLT_EVAL_METHOD, suspect it has a value of 1 or 2 (OP reports 2 with at least one compiler). This allows the compiler to evaluate float "... operations and constants to the range and precision" greater than float.

Your compiler optimized (float)x going directly int to double arithmetic. This is a performance improvement during run-time.

(float)2147483647 is a compile time cast and the compiler optimized for int to float to double accuracy as performance is not an issue here.


[Edit2] It is interesting that the C11 spec is more specific than the C99 spec with the addition of "Except for assignment and cast ...". This implies that C99 compilers were sometimes allowing the int to double direct conversion, without first going through float and that C11 was amended to clearly not allow skipping a cast.

With C11 formally excluding this behavior, modern compliers should not do this, but older ones, like OP's might - thus a bug by C11 standards. Unless some other C99 or C89 specification is found to say other-wise, this appears to be allowable compiler behavior.


[Edit] Taking comments together by @Keith Thompson, @tmyklebu, @Matt McNabb, the compiler, even with a non-zero FLT_EVAL_METHOD, should be expected to produce 2147483648.0.... Thus either a compiler optimization flag is explicitly over-riding correct behavior or the compiler has a corner bug.


C99dr §5.2.4.2.2 8 The values of operations with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type. The use of evaluation formats is characterized by the implementation-defined value of FLT_EVAL_METHOD:

-1 indeterminable;

0 evaluate all operations and constants just to the range and precision of the type;

1 evaluate operations and constants of type float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type`;

2 evaluate all operations and constants to the range and precision of the long double type.


C11dr §5.2.4.2.2 9 Except for assignment and cast (which remove all extra range and precision), the values yielded by operators with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type. The use of evaluation formats is characterized by the implementation-defined value of FLT_EVAL_METHOD

-1 (Same as C99)

0 (Same as C99)

1 (Same as C99)

2 (Same as C99)

Strange behaviour when comparing cast float to zero

The reason isZero and isZero2 can evaluate to different values, and isZero can be false, is that the C++ compiler is allowed to implement intermediate floating-point operations with more precision than the type of the expression would indicate, but the extra precision has to be dropped on assignment.

Typically, when generating code for the 387 historical FPU, the generated instructions work on either the 80-bit extended-precision type, or, if the FPU is set to a 53-bit significand (e.g. on Windows), a strange floating-point type with 53-bit significands and 15-bit exponents.

Either way, minVal/2.0f is evaluated exactly because the exponent range allows to represent it, but assigning it to nextCheck rounds it to zero.

If you are using GCC, there is the additional problem that -fexcess-precision=standard has not yet been implemented for the C++ front-end, meaning that the code generated by g++ does not implement exactly what the standard recommends.



Related Topics



Leave a reply



Submit