When a Float Variable Goes Out of the Float Limits, What Happens

When a float variable goes out of the float limits, what happens?

Formally, the behavior is undefined. On a machine with IEEE
floating point, however, overflow after rounding will result
in Inf. The precision is limited, however, and the results
after rounding of FLT_MAX + 1 are FLT_MAX.

You can see the same effect with values well under FLT_MAX.
Try something like:

float f1 = 1e20;     // less than FLT_MAX
float f2 = f1 + 1.0;
if ( f1 == f2 ) ...

The if will evaluate to true, at least with IEEE arithmetic.
(There do exist, or at least have existed, machines where
float has enough precision for the if to evaluate to
false, but they aren't very common today.)

How is float_max + 1 defined in C++?

In any rounding mode, max + 1 will simply be max with an IEEE-754 single-precision float.

Note that the maximum positive finite 32-bit float is:

                  3  2          1         0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 11111110 11111111111111111111111
Hex: 7F7F FFFF
Precision: SP
Sign: Positive
Exponent: 127 (Stored: 254, Bias: 127)
Hex-float: +0x1.fffffep127
Value: +3.4028235e38 (NORMAL)

For this number to overflow and become infinity using the default rounding mode of round-nearest-ties-to-even, you have to add at least:

                  3  2          1         0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 11100110 00000000000000000000000
Hex: 7300 0000
Precision: SP
Sign: Positive
Exponent: 103 (Stored: 230, Bias: 127)
Hex-float: +0x1p103
Value: +1.0141205e31 (NORMAL)

Anything you add less than this particular value will round it back to max value itself. Different rounding modes might have slightly different results, but the order of the number you're looking for is about 1e31, which is pretty darn large.

This is an excellent example of how IEEE floats get sparser and sparser as their magnitude increases.

How to overflow a float?

Try multiplying with 10, and if will overflow. The reason it doesn't overflow is the same reason why adding a small float to an already very large float doesn't actually change the value at all - it's a floating point format, meaning the number of digits of precision is limited.

Or, adding at least that last significant digit would likely work:

float f = 3.402823e38f; // FLT_MAX
f = f + 0.000001e38f; // this should result in overflow

subtracting Float.MIN_VALUE from another float number has no effect in android application

Since currentPayment is a Float, I would expect it should be able to hold any floating point value within the bounds of Float to it's maximum precision (i.e. Float.MIN_VALUE).

This is a wrong assumption. Float is called "float" because it has floating precision. The amount of precision depends on how big the number is that you're storing. The smallest possible float value is smaller than the precision of almost any other possible number, so it is too small to affect them if you add or subtract it. At the high end, Float numbers have precisions that are much greater than the integer 1. If you subtract 999,000,000 from Float.MAX_VALUE, it will still return Float.MAX_VALUE because the precision is so poor at the highest end.

Also, since floating point numbers are not stored in base-10, they are inappropriate for storing currency amounts, because you can never exactly represent a decimal fraction. (I mention that because your variable name has the word "payment" in it, which is a red flag.)

You should either use BigDecimal, Long, or Int to represent currency, so your currency amounts and arithmetic will be exact.

Here's an analogy to help understand it, since it is hard to contemplate binary numbers. Floats are 32-bits in Java and Kotlin, but imagine we have a special kind of computer that can store a floating point number in base-10. Each bit on this computer is not just 0 or 1, but can be anything from 0 to 9. A Float on this computer can have 4 digits and a decimal place, but the decimal place is floating, so it can be placed anywhere relative to the four digits. So a Float on this computer is always five bits--four of the bits are the digits, and the fifth bit tells you where the decimal place goes.

In this imaginary computer's Float, the smallest possible number that can be represented is .0001 and the largest possible number is 9999.. You can't represent 9999.5 or even 1000.5 because there aren't enough digits available. There's no fixed amount of precision--the precision is determined by where the decimal place is in the current number. Precision is better for numbers with a decimal place farther to the left.

For the number storage format to be able to have a fixed precision, we would have to fix the decimal point in one place for all numbers. We would have to choose a precision. Suppose we chose a precision of 0.001. Our fifth bit that told us where the decimal place goes in the floating point can now just be used for a fifth digit. Now we know the precision is always 0.001, but the largest possible number we can represent is 99.999 and the smallest possible number is 0.001, a much smaller possible range than with floating point. This limitation is the reason floating points are used instead.

Difference between decimal, float and double in .NET?

float (the C# alias for System.Single) and double (the C# alias for System.Double) are floating binary point types. float is 32-bit; double is 64-bit. In other words, they represent a number like this:


The binary number and the location of the binary point are both encoded within the value.

decimal (the C# alias for System.Decimal) is a floating decimal point type. In other words, they represent a number like this:


Again, the number and the location of the decimal point are both encoded within the value – that's what makes decimal still a floating point type instead of a fixed point type.

The important thing to note is that humans are used to representing non-integers in a decimal form, and expect exact results in decimal representations; not all decimal numbers are exactly representable in binary floating point – 0.1, for example – so if you use a binary floating point value you'll actually get an approximation to 0.1. You'll still get approximations when using a floating decimal point as well – the result of dividing 1 by 3 can't be exactly represented, for example.

As for what to use when:

  • For values which are "naturally exact decimals" it's good to use decimal. This is usually suitable for any concepts invented by humans: financial values are the most obvious example, but there are others too. Consider the score given to divers or ice skaters, for example.

  • For values which are more artefacts of nature which can't really be measured exactly anyway, float/double are more appropriate. For example, scientific data would usually be represented in this form. Here, the original values won't be "decimally accurate" to start with, so it's not important for the expected results to maintain the "decimal accuracy". Floating binary point types are much faster to work with than decimals.

Why two float type variables have different values

Floating point values have a finite size, and can therefore only represent real values with a finite precision. This leads to rounding errors when you need more precision than they can store.

In particular, when adding a small number (such as those you're summing) to a much larger number (such as your accumulator), the loss of precision can be quite large compared with the small number, giving a significant error; and the errors will be different depending on the order.

Typically, float has 24 bits of precision, corresponding to about 7 decimal places. Your accumulator requires 10 decimal places (around 30 bits), so you will experience this loss of precision. Typically, double has 53 bits (about 16 decimal places), so your result can be represented exactly.

A 64-bit integer may be the best option here, since all the inputs are integers. Using an integer avoids loss of precision, but introduces a danger of overflow if the inputs are too many or too large.

To minimise the error if you can't use a wide enough accumulator, you could sort the input so that the smallest values are accumulated first; or you could use more complicated methods such as Kahan summation.

Why float variable saves value by cutting digits after point in a weird way?

When represented as a float, your number has an exponent of 16 (i.e. the value is its mantisse times 2^16, or 65536). The mantisse then becomes

123456.123456 / 65536 = 1.8837909462890625

In order to fit in a 32-bit float, the mantisse is truncated to 23 bits, so now it becomes 1.883791. When multiplied back by 65536, it becomes 123456.125.

Note the 5 in the third position after the decimal point: the output routine of C++ that you used rounds it up, making your final number look like 123456.13.

EDIT Explanation of the rounding: (Rick Regan's comment)

The rounding occurs first in binary (to 24 bits), in decimal to binary conversion, and then to decimal, in printf. The stored value is 1.1110001001000000001 x 2^16 = 1.8837909698486328125 x 2^16 = 123456.125. It prints as 123456.13, but only because Visual C++ uses "round half away from zero" rounding.

Rick has an outstanding article on the subject, too.

If you would like to play with other numbers and their float representations, here is a very useful IEEE-754 calculator.

Related Topics

Leave a reply
