What Are the Rules Governing C++ Single and Double Precision Mixed Calculations

What are the rules governing C++ single and double precision mixed calculations?

All operations are done on objects of the same type (assuming normal arithmetic operations).

If you write a program that uses different types then the compiler will auto upgrade ONE parameter so that they are both the same.

In this situations floats will be upgraded to doubles:

result      = a * (b + c) * d

float tmp1 = b + c; // Plus operation done on floats.
// So the result is a float

double tmp2 = a * (double)tmp1; // Multiplication done on double (as `a` is double)
// so tmp1 will be up converted to a double.

double tmp3 = tmp2 * d; // Multiplication done on doubles.
// So result is a double

result = tmp3; // No conversion as tmp3 is same type as result.

Order of commutative mathematical operations

Based on the standard

[intro.abstract] - Note 7 (non-normative):

Operators can be regrouped according to the usual mathematical rules
only where the operators really are associative or commutative.

Mathematical rule for MDAS is from left to right (considering the associativity and precedence of operators). So it is evaluated as follows:

(((((c * d) * e) * 2) / 3) * f)

c-language data type arithmetic rules

This can get kind of ugly. The compiler looks at the types of the operands for a single operation, and promotes both to the "larger" type (e.g., if one is int and the other double, it'll convert the int to double, then do the operation).

In your case, that could have some rather unexpected results. Right now you have: 2*pi*j*X*Y/n. The operators group from left to right, so this is equivalent to ((((2*pi)*j)*X)*Y)/n. In this case, that'll probably work out reasonably well -- one of the operands in the "first" operation is a float, so all the other operands will be converted to float as you want. If, however, you rearrange the operands (even in a way that seems equivalent in normal math) the result could be completely different. Just for example, if you rearranged it to 2*Y/n*pi*j*X, the 2*Y/n part would be done using integer arithmetic because 2, Y, and n are all integers. This means the division would be done on integers, giving an integer result, and only after that integer result was obtained would that integer be converted to a float for multiplication by pi.

Bottom line: unless you're dealing with something like a large array so converting to smaller types is likely to really save quite a bit of memory, you're generally much better off keeping all the operands of the same type if possible. I'd also note that in this case, your attempt at "managing memory intelligently" probably won't do any good anyway -- on a typical current machine, a long int and a float are both 32 bits, so they both use the same amount of memory in any case. Also note that exp takes a double as its operand, so even if you do float math for the rest, it'll be promoted to a double anyway. Also note that conversions from int to float (and back) can be fairly slow.

If you're really only dealing with a half dozen variables or so, you're almost certainly best off leaving them as double and being done with it. Converting to a combination of float and long will save about 14 bytes of data storage, but then add (around) 14 bytes of extra instructions to handle all the conversions between int, float, and double at the right times, so you'll end up with slower code that uses just as much memory anyway.

Order of operations to maximize precision

Really, if you don't use double then you are misguided, and you don't care about precision.

Otherwise, you get the best error bounds if the first result is slightly lower than the next higher power of two. For example, calculating (pi * e) / sqrt (2), you get the best error bounds by calculating (e / sqrt (2)) * pi, because e / sqrt (2) ≈ 1.922 is close below 2. Results close to the next higher power of two have a lower relative error.

For addition and subtraction of a large number of items, it's best to first subtract items of equal magnitude and opposite sign (x - y is calculated exactly if y/2 ≤ x ≤ 2y), and otherwise combining numbers giving the smallest possible results.

Why are double preferred over float?

In my opinion the answers so far don't really get the right point across, so here's my crack at it.

The short answer is C++ developers use doubles over floats:

  • To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
  • Habit
  • Culture
  • To match library function signatures
  • To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)

It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.

However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.

Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:

  • They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
  • If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.

In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.

The order of multiplications

operator * has left to right associativity:

int m = ((a * b) * c) * d;

While in math it doesn't matter (multiplication is associative), in case of both C and C++ we may have or not have overflow depending on the order.

0 * INT_MAX * INT_MAX // 0
INT_MAX * INT_MAX * 0 // overflow

And things are getting even more complex if we consider floating point types or operator overloading. See comments of @delnan and @melpomene.

How is the standarized way to calculate float with integers?

When numeric values of various types are combined in a expression, they are subject to the usual arithmetic conversions, which is a set of rules which dictate which operand should be converted and to what type.

These conversions are spelled out in section 6.3.1.8 of the C standard:

Many operators that expect operands of arithmetic type cause
conversions and yield result types in a similar way. The purpose is
to determine a common real type for the operands and result. For the
specified operands, each operand is converted, without change of type
domain, to a type whose corresponding real type is the
common real type. Unless explicitly stated otherwise, the
common real type is also the corresponding real type of the
result, whose type domain is the type domain of the operands
if they are the same, and complex otherwise. This pattern is
called the usual arithmetic conversions :

  • First, if the corresponding real type of either operand is long double , the other operand is converted, without change of type domain, to a type whose corresponding real type is long
    double .
  • Otherwise, if the corresponding real type of either operand is double , the other operand is converted, without change of type domain, to a type whose corresponding real type is
    double .
  • Otherwise, if the corresponding real type of either operand is float , the other operand is converted, without change of type domain, to a type whose corresponding real type is
    float .
  • Otherwise, the integer promotions are performed on both operands. Then the following rules are applied to the promoted
    operands:

    • If both operands have the same type, then no further
      conversion is needed.
    • Otherwise, if both operands have signed
      integer types or both have unsigned integer types, the operand
      with the type of lesser integer conversion rank is converted
      to the type of the operand with greater rank.
    • Otherwise, if the
      operand that has unsigned integer type has rank greater or
      equal to the rank of the type of the other operand, then
      the operand with signed integer type is converted to the type
      of the operand with unsigned integer type.
    • Otherwise, if the
      type of the operand with signed integer type can represent all of the
      values of the type of the operand with unsigned integer type, then the
      operand with unsigned integer type is converted to the type
      of the operand with signed integer type.
    • Otherwise, both operands are converted to the unsigned integer type
      corresponding to the type of the operand with signed integer type.

Note in particular the paragraph in bold, which is what applies in your case.

The floating point constant 0.5 has type double, so the value of other operand is converted to type double, and the result of the multiplication operator * has type double. This result is then assigned back to a variable of type uint8_t, so the double value is converted to this type for assignment.

So in this case Result will have the value 100.

C/C++ Math Order of Operation

In your example the compiler is free to evaluate "1" "2" and "3" in any order it likes, and then apply the divisions left to right.

It's the same for the i++ + i++ example. It can evaluate the i++'s in any order and that's where the problem lies.

It's not that the function's precedence isn't defined, it's that the order of evaluation of its arguments is.

Type conversion in divisions

First thing is 12 is true value of expression:

                    20      20 * 6      120
20 / (10.0 / 6) = ------ = -------- = ------ = 12
10 10 10
----
6

Second thing is floating point has finite precision. Double has 52 bits of mantissa which is close to 16 decimal digits.

So decimal value of 10/6 close to 1.6666666666666667. But internally it is binary number - something close to
1.6666666666666667406815349750104360282421112060547 if we could represent binary in decimal directly - what we can not.

Above equation evaluates so close to 12 that is what is being returned - even as double.



Related Topics



Leave a reply



Submit