Is Floating-Point == Ever Ok

Is floating-point == ever OK?

There are two ways to answer this question:

  1. Are there cases where float == float gives the correct result?
  2. Are there cases where float == float is acceptable coding?

The answer to (1) is: Yes, sometimes. But it's going to be fragile, which leads to the answer to (2): No. Don't do that. You're begging for bizarre bugs in the future.

As for a call of the form foo(BAR): In that particular case the comparison will return true, but when you are writing foo you don't know (and shouldn't depend on) how it is called. For example, calling foo(BAR) will be fine but foo(BAR * 2.0 / 2.0) (or even maybe foo(BAR * 1.0) depending on how much the compiler optimises things away) will break. You shouldn't be relying on the caller not performing any arithmetic!

Long story short, even though a == b will work in some cases you really shouldn't rely on it. Even if you can guarantee the calling semantics today maybe you won't be able to guarantee them next week so save yourself some pain and don't use ==.

To my mind, float == float is never* OK because it's pretty much unmaintainable.

*For small values of never.

Is floating point math broken?

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

  • 0.1000000000000000055511151231257827021181583404541015625 in decimal, or
  • 0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

  • 0.1 in decimal, or
  • 0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

** Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

Cases where floating-point numbers are comparable using equality

The double-precision floating point representation (64-bit per number) is exact for integers up to -+2**53 (-+ 9,007,199,254,740,992). If you are using floating point numbers but starting from integers and doing integer computations with them and you never passed that limit then the result is exact and using == is perfectly fine.

Numbers that in general can be represented exactly are N/M where N is integer and M is a power of two. Thus if you're just doing computations involving e.g. 1/4, 1/2, 3/4 and integer multiples of them you're fine too until you reach very big multipliers.

When instead you deal with numbers that cannot be represented exactly (e.g. 0.1) the approximation introduced my lead to surprising results. One source of problems is that intermediate results may be stored in temporaries with higher precision and thus the result of a formula may be different depending on if you store it in memory explicitly or not and it may also change depending on the optimization level.

Is it safe to check floating point values for equality to 0?

It is safe to expect that the comparison will return true if and only if the double variable has a value of exactly 0.0 (which in your original code snippet is, of course, the case). This is consistent with the semantics of the == operator. a == b means "a is equal to b".

It is not safe (because it is not correct) to expect that the result of some calculation will be zero in double (or more generally, floating point) arithmetics whenever the result of the same calculation in pure Mathematics is zero. This is because when calculations come into the ground, floating point precision error appears - a concept which does not exist in Real number arithmetics in Mathematics.

How should I do floating point comparison?

Comparing for greater/smaller is not really a problem unless you're working right at the edge of the float/double precision limit.

For a "fuzzy equals" comparison, this (Java code, should be easy to adapt) is what I came up with for The Floating-Point Guide after a lot of work and taking into account lots of criticism:

public static boolean nearlyEqual(float a, float b, float epsilon) {
final float absA = Math.abs(a);
final float absB = Math.abs(b);
final float diff = Math.abs(a - b);

if (a == b) { // shortcut, handles infinities
return true;
} else if (a == 0 || b == 0 || diff < Float.MIN_NORMAL) {
// a or b is zero or both are extremely close to it
// relative error is less meaningful here
return diff < (epsilon * Float.MIN_NORMAL);
} else { // use relative error
return diff / (absA + absB) < epsilon;
}
}

It comes with a test suite. You should immediately dismiss any solution that doesn't, because it is virtually guaranteed to fail in some edge cases like having one value 0, two very small values opposite of zero, or infinities.

An alternative (see link above for more details) is to convert the floats' bit patterns to integer and accept everything within a fixed integer distance.

In any case, there probably isn't any solution that is perfect for all applications. Ideally, you'd develop/adapt your own with a test suite covering your actual use cases.

Are there any floating-point comparison anomalies ?

Assuming IEEE-754 floating-point:

  • a >= b is always equivalent to b <= a.*
  • a >= b is equivalent to !(a < b), unless one or both of a or b is NaN.
  • a == b is always equivalent to b == a.*
  • a == b is equivalent to !(a != b), unless one or both of a or b is NaN.

More generally: trichotomy does not hold for floating-point numbers. Instead, a related property holds [IEEE-754 (1985) §5.7]:

Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself.

Note that this is not really an "anomaly" so much as a consequence of extending the arithmetic to be closed in a way that attempts to maintain consistency with real arithmetic when possible.

[*] true in abstract IEEE-754 arithmetic. In real usage, some compilers might cause this to be violated in rare cases as a result of doing computations with extended precision (MSVC, I'm looking at you). Now that most floating-point computation on the Intel architecture is done on SSE instead of x87, this is less of a concern (and it was always a bug from the standpoint of IEEE-754, anyway).

How dangerous is it to compare floating point values?

First of all, floating point values are not "random" in their behavior. Exact comparison can and does make sense in plenty of real-world usages. But if you're going to use floating point you need to be aware of how it works. Erring on the side of assuming floating point works like real numbers will get you code that quickly breaks. Erring on the side of assuming floating point results have large random fuzz associated with them (like most of the answers here suggest) will get you code that appears to work at first but ends up having large-magnitude errors and broken corner cases.

First of all, if you want to program with floating point, you should read this:

What Every Computer Scientist Should Know About Floating-Point Arithmetic

Yes, read all of it. If that's too much of a burden, you should use integers/fixed point for your calculations until you have time to read it. :-)

Now, with that said, the biggest issues with exact floating point comparisons come down to:

  1. The fact that lots of values you may write in the source, or read in with scanf or strtod, do not exist as floating point values and get silently converted to the nearest approximation. This is what demon9733's answer was talking about.

  2. The fact that many results get rounded due to not having enough precision to represent the actual result. An easy example where you can see this is adding x = 0x1fffffe and y = 1 as floats. Here, x has 24 bits of precision in the mantissa (ok) and y has just 1 bit, but when you add them, their bits are not in overlapping places, and the result would need 25 bits of precision. Instead, it gets rounded (to 0x2000000 in the default rounding mode).

  3. The fact that many results get rounded due to needing infinitely many places for the correct value. This includes both rational results like 1/3 (which you're familiar with from decimal where it takes infinitely many places) but also 1/10 (which also takes infinitely many places in binary, since 5 is not a power of 2), as well as irrational results like the square root of anything that's not a perfect square.

  4. Double rounding. On some systems (particularly x86), floating point expressions are evaluated in higher precision than their nominal types. This means that when one of the above types of rounding happens, you'll get two rounding steps, first a rounding of the result to the higher-precision type, then a rounding to the final type. As an example, consider what happens in decimal if you round 1.49 to an integer (1), versus what happens if you first round it to one decimal place (1.5) then round that result to an integer (2). This is actually one of the nastiest areas to deal with in floating point, since the behaviour of the compiler (especially for buggy, non-conforming compilers like GCC) is unpredictable.

  5. Transcendental functions (trig, exp, log, etc.) are not specified to have correctly rounded results; the result is just specified to be correct within one unit in the last place of precision (usually referred to as 1ulp).

When you're writing floating point code, you need to keep in mind what you're doing with the numbers that could cause the results to be inexact, and make comparisons accordingly. Often times it will make sense to compare with an "epsilon", but that epsilon should be based on the magnitude of the numbers you are comparing, not an absolute constant. (In cases where an absolute constant epsilon would work, that's strongly indicative that fixed point, not floating point, is the right tool for the job!)

Edit: In particular, a magnitude-relative epsilon check should look something like:

if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y))

Where FLT_EPSILON is the constant from float.h (replace it with DBL_EPSILON fordoubles or LDBL_EPSILON for long doubles) and K is a constant you choose such that the accumulated error of your computations is definitely bounded by K units in the last place (and if you're not sure you got the error bound calculation right, make K a few times bigger than what your calculations say it should be).

Finally, note that if you use this, some special care may be needed near zero, since FLT_EPSILON does not make sense for denormals. A quick fix would be to make it:

if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y) || fabs(x-y) < FLT_MIN)

and likewise substitute DBL_MIN if using doubles.



Related Topics



Leave a reply



Submit