How to Detect Double Precision Floating Point Overflow and Underflow

How to detect double precision floating point overflow and underflow?

A lot depends on context. To be perfectly portable, you have to
check before the operation, e.g. (for addition):

if ( (a < 0.0) == (b < 0.0)
&& std::abs( b ) > std::numeric_limits<double>::max() - std::abs( a ) ) {
// Addition would overflow...
}

Similar logic can be used for the four basic operators.

If all of the machines you target support IEEE (which is
probably the case if you don't have to consider mainframes), you
can just do the operations, then use isfinite or isinf on
the results.

For underflow, the first question is whether a gradual underflow
counts as underflow or not. If not, then simply checking if the
results are zero and a != -b would do the trick. If you want
to detect gradual underflow (which is probably only present if
you have IEEE), then you can use isnormal—this will
return false if the results correspond to gradual underflow.
(Unlike overflow, you test for underflow after the operation.)

How can I detect lost of precision due to rounding in both floating point addition and multiplication?

While in mathematics, addition and multiplication of real numbers are associative operations, those operations are not associative when performed on floating point types, like float, due to the limited precision and range extension.

So the order matters.

Considering the examples, the number 10000000003.14 can't be exactly represented as a 32-bit float, so the result of (3.14f + 1e10f) would be equal to 1e10f, which is the closest representable number. Of course, 3.14f + (1e10f - 1e10f) would yeld 3.14f instead.

Note that I used the f postfix, because in C the expression (3.14+1e10)-1e10 involves double literals, so that the result would be indeed 3.14 (or more likely something like 3.14999).

Something similar happens in the second example, where 1e20f * 1e20f is already beyond the range of float (but not of double) and the succesive multiplication is meaningless, while (1e20f * 1e-20f), which is performed first in the other expression, has a well defined result (1) and the successive multiplication yelds the correct answer.

In practice, there are some precautions you adopt

  • Use a wider type. double is a best fit for most applications, unless there are other requirements.
  • Reorder the operations, if possible. For example, if you have to add many terms and you know that some of them are smaller than others, start adding those, then the others. Avoid subtraction of numbers of the same order of magnitude. In general, there may be a more accurate way to evaluate an algebraic expression than the naive one (e.g. Horner's method for polynomial evaluation).
  • If you have some sort of knowledge of the problem domain, you may already know which part of the computation may have problematic values and check if those are greater (or lower) than some limits, before performing the calculation.
  • Check the results as soon as possible. There's no point in continuing a calculation when you already have an infinite value or a NaN, or keep iterating when your target value isn't modified at all.

How to detect and prevent integer overflow when multiplying an integer by float in Java?

Below is a C approach that may shed light in Java.

Perform the multiplication using double, not float math before the assginment to gain the extra precision/range of double. Overflow is not then expected.

A compare like c > Integer.MAX_VALUE suffers from Integer.MAX_VALUE first being converted into a double. This may lose precision.*1 Consider what happens if the converted value is Integer.MAX_VALUE + 1.0. Then if c is Integer.MAX_VALUE + 1.0, code will attempt to return (int) (Integer.MAX_VALUE + 1.0) - not good. Better to use well formed limits. (Negative ones too.) In C, maybe Java, floating point conversion to int truncates the fraction. Special care is needed near the edges.

#define INT_MAX_PLUS1_AS_DOUBLE ((INT_MAX/2 + 1)*2.0)

int mulInt(int a, float b) {
// double c = a * b;
double c = (double) a * b;

//return c > Integer.MAX_VALUE ? Integer.MAX_VALUE : (int)c;
if (c < INT_MAX_PLUS1_AS_DOUBLE && c - INT_MIN > -1.0) {
return (int) c;
}
if (c > 0) return INT_MAX;
if (c < 0) return INT_MIN;
return 0; // `b` was a NaN
}

c - INT_MIN > -1 is like c > INT_MIN - 1, but as INT_MIN is a -power-of-2, INT_MIN - 1 might not convert precisely to double. c - INT_MIN is expected to be exact near the edge cases.


*1 When int is 32-bit (or less) and double is 64-bit (with 53-bit significant) not an issue. But important with wider integer types.

What is overflow and underflow in floating point

Of course the following is implementation dependent, but if the numbers behave anything like what IEEE-754 specifies, Floating point numbers do not overflow and underflow to a wildly incorrect answer like integers do, e.g. you really should not end up with two positive numbers being multiplied resulting in a negative number.

Instead, overflow would mean that the result is 'too large to represent'. Depending on the rounding mode, this either usually gets represented by max float(RTZ) or Inf (RNE):

0 110 1111 * 0 110 1111 = 0 111 0000

(Note that the overflowing of integers as you know it could have been avoided in hardware by applying a similar clamping operation, it's just not the convention to do that.)

When dealing with floating point numbers the term underflow means that the number is 'too small to represent', which usually just results in 0.0:

0 000 0001 * 0 000 0001 = 0 000 0000

Note that I have also heard the term underflow being used for overflow to a very large negative number, but this is not the best term for it. This is an example of when the result is negative and too large to represent, i.e. 'negative overflow':

0 110 1111 * 1 110 1111 = 1 111 0000

Overflow and Underflow in Java Float and Double Data Types

These "weird" results are not really specific to Java. It's just that floats as defined by the relevant IEEE standard, are much more complicated than most people suspect. But onto your specific results: Float.MIN_VALUE is the smallest positive float, so it's very close to 0. Hence Float.MIN_VALUE - 1 will be very close to -1. But since the float precision around -1 is greater than that difference, it comes out as -1. As to Float.MAX_VALUE, the float precision around this value is much greater than 1 and adding one doesn't change the result.

Under flow with floating point arithmetic checking

Is there something similar to checked keyword that works for doubles?

Nope.

Is there some way I can implement checks in an assisted way?

A bad solution: depending on what hardware you are using, the floating point arithmetic chip may set a flag that indicates whether an operation has underflowed. I do not recommend calling into unmanaged code to read that flag off the floating point chip. (I wrote the code to do that in the original Microsoft version of Javascript and it is a pain to get that logic right.)

A better solution: you might consider writing a symbolic logic library. Consider for example what happens if you make your own number type:

struct ExpNumber 
{
public double Exponent { get; }
public ExpNumber(double e) => Exponent = e;
public static ExpNumber operator *(ExpNumber x1, ExpNumber x2) =>
new ExpNumber(x1.Exponent + x2.Exponent);

And so on. You can define your own addition, subtraction, powers, logarithms, and so on, using the identities you know for powers. Then when it is time to realize the thing back to a double, you can implement that using whatever stable algorithm that avoid underflow that you prefer.

The problem is that doubles intentionally trade off a decrease in representational power and accuracy for a massive increase in speed. If you need to accurately represent numbers smaller than 10e-200, doubles are not for you; they were designed to solve problems in physics computation, and there are no physical quantities that small.

Ignore floating-point overflow and underflow errors in C++

For Microsoft Visual C++ you can use _controlfp_s to get and to set the floating-point control word. For your code snippet a possible solution would look like:

int main()
{
unsigned int fp_control;
//Reading
_controlfp_s(&fp_control, 0, 0);
//Make changes
unsigned int new_fp_control = fp_control | _EM_OVERFLOW | _EM_UNDERFLOW;
//Update
_controlfp_s(&fp_control, new_fp_control, _MCW_EM);

float a = 68440675640679078541805800652800.0f;
float b = a*a;
out << b << ed;
}


Related Topics



Leave a reply



Submit