C++ Floating Point Precision

C floating point precision

You can only represent numbers exactly in IEEE754 (at least for the single and double precision binary formats) if they can be constructed from adding together inverted powers of two (i.e., 2^-n like 1, 1/2, 1/4, 1/65536 and so on) subject to the number of bits available for precision.

There is no combination of inverted powers of two that will get you exactly to 101.1, within the scaling provided by floats (23 bits of precision) or doubles (52 bits of precision).

If you want a quick tutorial on how this inverted-power-of-two stuff works, see this answer.

Applying the knowledge from that answer to your 101.1 number (as a single precision float):

s C floating point precision Precision in C floats How to set precision of a float Floating point precision not matching for the value what was actually assigned? precisi mmmmmmmmmmmmmmmmmmmmmmm    1/n
0 10000101 10010100011001100110011
           |  | |   ||  ||  ||  |+- 8388608
           |  | |   ||  ||  ||  +-- 4194304
           |  | |   ||  ||  |+-----  524288
           |  | |   ||  ||  +------  262144
           |  | |   ||  |+---------   32768
           |  | |   ||  +----------   16384
           |  | |   |+-------------    2048
           |  | |   +--------------    1024
           |  | +------------------      64
           |  +--------------------      16
           +-----------------------       2

The mantissa part of that actually continues forever for 101.1:

mmmmmmmmm mmmm mmmm mmmm mm
100101000 1100 1100 1100 11|00 1100 (and so on).

hence it's not a matter of precision, no amount of finite bits will represent that number exactly in IEEE754 format.

Using the bits to calculate the actual number (closest approximation), the sign is positive. The exponent is 128+4+1 = 133 - 127 bias = 6, so the multiplier is 2⁶ or 64.

The mantissa consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2ⁿ) as n starts at 1 and increases to the right), {1/2, 1/16, 1/64, 1/1024, 1/2048, 1/16384, 1/32768, 1/262144, 1/524288, 1/4194304, 1/8388608}.

When you add all these up, you get 1.57968747615814208984375.

When you multiply that by the multiplier previously calculated, 64, you get 101.09999847412109375.

All numbers were calculated with bc using a scale of 100 decimal digits, resulting in a lot of trailing zeros, so the numbers should be very accurate. Doubly so, since I checked the result with:

#include <stdio.h>
int main (void) {
    float f = 101.1f;
    printf ("%.50f\n", f);
    return 0;
}

which also gave me 101.09999847412109375000....

Precision in C floats

"6 digits after the decimal point" is nonesnse, and your example is a good demonstration of this.

This is an exact specification of the float data type.

The precision of the float is 24 bits. There are 23 bits denoting the fraction after the binary point, plus there's also an "implicit leading bit", according to the online source. This gives 24 significant bits in total.

Hence in decimal digits this is approximately:

24 * log(2) / log(10) = 7.22

How to set precision of a float

You can't do that, since precision is determined by the data type (i.e. float or double or long double). If you want to round it for printing purposes, you can use the proper format specifiers in printf(), i.e. printf("%0.3f\n", 0.666666666).

Floating point precision not matching for the value what was actually assigned?

The most common floating-point format, IEEE 754, includes a basic 32-bit format and a basic 64-bit format, and these are commonly used for the float and double types in C. For brevity, I will call them float and double in this answer.

Neither of these types can exactly represent non-integer numbers other than those that are multiples of a power of two (such as ¼, ¾, 1/1024, 73/1048576). Every other number will be changed slightly when it is converted from decimal to float or double.

However, the float has the property that rounding any decimal numeral with six significant digits (such as 1.2345 or 9.87654e23) to float and back to six significant decimal digits returns the original number (provided the number is within normal bounds of the format). In C, this number of digits is reported by the value FLT_DIG, which is defined by float.h. Since your number 78.352361 has eight significant digits, it is not guaranteed to survive a conversion to float and back.

For double, at least 15 digits will survive a round trip, reported by DBL_DIG.

Note that this is the number of decimal digits guaranteed to survive the rounding caused by one conversion to binary floating-point and back to the original number of decimal digits. If additional arithmetic is performed in floating-point, additional roundings occur, which may accumulate more error. And, if a value is formatted with more decimal digits than the original, then the result may differ from the original number. (For example, .9f produces “.9” when converted back to one decimal digit but “0.899999976” when converted to nine decimal digits.)

Since double guarantees that 15 digits survive a round trip, your number 78.352361 would survive a conversion to double and back to eight significant digits unchanged. Additionally, there is enough precision to perform some arithmetic without accumulating so much error that it is visible in eight significant decimal digits. However, floating-point arithmetic can be tricky, and a complete error analysis depends on the operations you perform.

precision between float and double in C

The reason why they produce the same number of decimal places, is because 6 is the default value. You can change that as in the edited example below, where the syntax is %.*f. The * can be either a number as shown below, or in the second case, supplied as another argument.

#include <stdio.h>

int main(void) {
    int a = 960;
    int b = 16;

    float c = a*0.001;  
    float d = a*0.001 + b;
    double e = a*0.001 + b;

    printf("%.9f\n", c);
    printf("%.*f\n", 9, d);
    printf("%.16f\n", e);
}

Program output:


0.959999979
16.959999084
16.9600000000000009

The extra decimal places now shows that none of the results is exact. One reason is because 0.001 cannot be exactly coded as a floating point value. There are other reasons too, which have been extensively covered.

One easy way to understand why, is that a float has about 2^32 different values that can be encoded, however there is an infinity of real numbers within the range of float, and only about 2^32 of them can be represented exactly. In the case of the fraction 1/1000, in binary it is a recurring value (as is the fraction 1/3 in decimal).

Is there any way to not lose the precision and still get the value?

float only guarantees 6 decimal digits of precision, so any computation with a float (even if the other operands are double, even if you're storing the result to a double) will only be precise to 6 digits.

If you need greater precision, then limit yourself to double or long double. If you need more than 10 decimal digits of precision, then you'll need to use something other than the native floating point types and library functions. You'll either need to roll your own, or use an arbitrary precision math library like GNU MP.

Why does C print float values after the decimal point different from the input value?

Your computer uses binary floating point internally. Type float has 24 bits of precision, which translates to approximately 7 decimal digits of precision.

Your number, 2118850.132, has 10 decimal digits of precision. So right away we can see that it probably won't be possible to represent this number exactly as a float.

Furthermore, due to the properties of binary numbers, no decimal fraction that ends in 1, 2, 3, 4, 6, 7, 8, or 9 (that is, numbers like 0.1 or 0.2 or 0.132) can be exactly represented in binary. So those numbers are always going to experience some conversion or roundoff error.

When you enter the number 2118850.132 as a float, it is converted internally into the binary fraction 1000000101010011000010.01. That's equivalent to the decimal fraction 2118850.25. So that's why the .132 seems to get converted to 0.25.

As I mentioned, float has only 24 bits of precision. You'll notice that 1000000101010011000010.01 is exactly 24 bits long. So we can't, for example, get closer to your original number by using something like 1000000101010011000010.001, which would be equivalent to 2118850.125, which would be closer to your 2118850.132. No, the next lower 24-bit fraction is 1000000101010011000010.00 which is equivalent to 2118850.00, and the next higher one is 1000000101010011000010.10 which is equivalent to 2118850.50, and both of those are farther away from your 2118850.132. So 2118850.25 is as close as you can get with a float.

If you used type double you could get closer. Type double has 53 bits of precision, which translates to approximately 16 decimal digits. But you still have the problem that .132 ends in 2 and so can never be exactly represented in binary. As type double, your number would be represented internally as the binary number 1000000101010011000010.0010000111001010110000001000010 (note 53 bits), which is equivalent to 2118850.132000000216066837310791015625, which is much closer to your 2118850.132, but is still not exact. (Also notice that 2118850.132000000216066837310791015625 begins to diverge from your 2118850.1320000000 after 16 digits.)

So how do you avoid this? At one level, you can't. It's a fundamental limitation of finite-precision floating-point numbers that they cannot represent all real numbers with perfect accuracy. Also, the fact that computers typically use binary floating-point internally means that they can almost never represent "exact-looking" decimal fractions like .132 exactly.

There are two things you can do:

If you need more than about 7 digits worth of precision, definitely use type double, don't try to use type float.
If you believe your data is accurate to three places past the decimal, print it out using %.3f. If you take 2118850.132 as a double, and printf it using %.3f, you'll get 2118850.132, like you want. (But if you printed it with %.12f, you'd get the misleading 2118850.132000000216.)

C++ Floating Point Precision