What are the rules governing C++ single and double precision mixed calculations?
All operations are done on objects of the same type (assuming normal arithmetic operations).
If you write a program that uses different types then the compiler will auto upgrade ONE parameter so that they are both the same.
In this situations floats will be upgraded to doubles:
result = a * (b + c) * d
float tmp1 = b + c; // Plus operation done on floats.
// So the result is a float
double tmp2 = a * (double)tmp1; // Multiplication done on double (as `a` is double)
// so tmp1 will be up converted to a double.
double tmp3 = tmp2 * d; // Multiplication done on doubles.
// So result is a double
result = tmp3; // No conversion as tmp3 is same type as result.
Order of commutative mathematical operations
Based on the standard
[intro.abstract] - Note 7 (non-normative):
Operators can be regrouped according to the usual mathematical rules
only where the operators really are associative or commutative.
Mathematical rule for MDAS is from left to right (considering the associativity and precedence of operators). So it is evaluated as follows:
(((((c * d) * e) * 2) / 3) * f)
c-language data type arithmetic rules
This can get kind of ugly. The compiler looks at the types of the operands for a single operation, and promotes both to the "larger" type (e.g., if one is int
and the other double
, it'll convert the int
to double
, then do the operation).
In your case, that could have some rather unexpected results. Right now you have: 2*pi*j*X*Y/n
. The operators group from left to right, so this is equivalent to ((((2*pi)*j)*X)*Y)/n
. In this case, that'll probably work out reasonably well -- one of the operands in the "first" operation is a float, so all the other operands will be converted to float as you want. If, however, you rearrange the operands (even in a way that seems equivalent in normal math) the result could be completely different. Just for example, if you rearranged it to 2*Y/n*pi*j*X
, the 2*Y/n
part would be done using integer arithmetic because 2
, Y
, and n
are all integers. This means the division would be done on integers, giving an integer result, and only after that integer result was obtained would that integer be converted to a float for multiplication by pi
.
Bottom line: unless you're dealing with something like a large array so converting to smaller types is likely to really save quite a bit of memory, you're generally much better off keeping all the operands of the same type if possible. I'd also note that in this case, your attempt at "managing memory intelligently" probably won't do any good anyway -- on a typical current machine, a long int
and a float
are both 32 bits, so they both use the same amount of memory in any case. Also note that exp
takes a double
as its operand, so even if you do float
math for the rest, it'll be promoted to a double
anyway. Also note that conversions from int
to float
(and back) can be fairly slow.
If you're really only dealing with a half dozen variables or so, you're almost certainly best off leaving them as double
and being done with it. Converting to a combination of float
and long
will save about 14 bytes of data storage, but then add (around) 14 bytes of extra instructions to handle all the conversions between int
, float
, and double
at the right times, so you'll end up with slower code that uses just as much memory anyway.
Order of operations to maximize precision
Really, if you don't use double then you are misguided, and you don't care about precision.
Otherwise, you get the best error bounds if the first result is slightly lower than the next higher power of two. For example, calculating (pi * e) / sqrt (2), you get the best error bounds by calculating (e / sqrt (2)) * pi, because e / sqrt (2) ≈ 1.922 is close below 2. Results close to the next higher power of two have a lower relative error.
For addition and subtraction of a large number of items, it's best to first subtract items of equal magnitude and opposite sign (x - y is calculated exactly if y/2 ≤ x ≤ 2y), and otherwise combining numbers giving the smallest possible results.
Why are double preferred over float?
In my opinion the answers so far don't really get the right point across, so here's my crack at it.
The short answer is C++ developers use doubles over floats:
- To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
- Habit
- Culture
- To match library function signatures
- To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)
It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.
However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.
Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:
- They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
- If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.
In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.
The order of multiplications
operator *
has left to right associativity:
int m = ((a * b) * c) * d;
While in math it doesn't matter (multiplication is associative), in case of both C and C++ we may have or not have overflow depending on the order.
0 * INT_MAX * INT_MAX // 0
INT_MAX * INT_MAX * 0 // overflow
And things are getting even more complex if we consider floating point types or operator overloading. See comments of @delnan and @melpomene.
How is the standarized way to calculate float with integers?
When numeric values of various types are combined in a expression, they are subject to the usual arithmetic conversions, which is a set of rules which dictate which operand should be converted and to what type.
These conversions are spelled out in section 6.3.1.8 of the C standard:
Many operators that expect operands of arithmetic type cause
conversions and yield result types in a similar way. The purpose is
to determine a common real type for the operands and result. For the
specified operands, each operand is converted, without change of type
domain, to a type whose corresponding real type is the
common real type. Unless explicitly stated otherwise, the
common real type is also the corresponding real type of the
result, whose type domain is the type domain of the operands
if they are the same, and complex otherwise. This pattern is
called the usual arithmetic conversions :
- First, if the corresponding real type of either operand is long double , the other operand is converted, without change of type domain, to a type whose corresponding real type is long
double .- Otherwise, if the corresponding real type of either operand is double , the other operand is converted, without change of type domain, to a type whose corresponding real type is
double .- Otherwise, if the corresponding real type of either operand is float , the other operand is converted, without change of type domain, to a type whose corresponding real type is
float .- Otherwise, the integer promotions are performed on both operands. Then the following rules are applied to the promoted
operands:
- If both operands have the same type, then no further
conversion is needed.- Otherwise, if both operands have signed
integer types or both have unsigned integer types, the operand
with the type of lesser integer conversion rank is converted
to the type of the operand with greater rank.- Otherwise, if the
operand that has unsigned integer type has rank greater or
equal to the rank of the type of the other operand, then
the operand with signed integer type is converted to the type
of the operand with unsigned integer type.- Otherwise, if the
type of the operand with signed integer type can represent all of the
values of the type of the operand with unsigned integer type, then the
operand with unsigned integer type is converted to the type
of the operand with signed integer type.- Otherwise, both operands are converted to the unsigned integer type
corresponding to the type of the operand with signed integer type.
Note in particular the paragraph in bold, which is what applies in your case.
The floating point constant 0.5
has type double
, so the value of other operand is converted to type double
, and the result of the multiplication operator *
has type double
. This result is then assigned back to a variable of type uint8_t
, so the double
value is converted to this type for assignment.
So in this case Result
will have the value 100.
C/C++ Math Order of Operation
In your example the compiler is free to evaluate "1" "2" and "3" in any order it likes, and then apply the divisions left to right.
It's the same for the i++ + i++ example. It can evaluate the i++'s in any order and that's where the problem lies.
It's not that the function's precedence isn't defined, it's that the order of evaluation of its arguments is.
Type conversion in divisions
First thing is 12 is true value of expression:
20 20 * 6 120
20 / (10.0 / 6) = ------ = -------- = ------ = 12
10 10 10
----
6
Second thing is floating point has finite precision. Double has 52 bits of mantissa which is close to 16 decimal digits.
So decimal value of 10/6 close to 1.6666666666666667. But internally it is binary number - something close to
1.6666666666666667406815349750104360282421112060547 if we could represent binary in decimal directly - what we can not.
Above equation evaluates so close to 12 that is what is being returned - even as double.
Related Topics
The Fastest Way to Retrieve 16K Key-Value Pairs
Delete Pointer to Multidimensional Array in Class Through Another Pointer - How
Strange Behavior with Constexpr Static Member Variable
Finding the Max Value in a Map
Why Does (1 << 31) >> 31 Result in -1
C++ When Should We Prefer to Use a Two Chained Static_Cast Over Reinterpret_Cast
Vector Memory Allocation Strategy
What Does "-Wall" in "G++ -Wall Test.Cpp -O Test" Do
Memory Allocation/Deallocation
What Are Consequences of Forcing Qobject as a Parent of Qwidget
Print Out All Combinations of Index
How to Detect Negative Numbers as Parsing Errors When Reading Unsigned Integers
What Are the Incompatible Differences Between C(99) and C++(11)
C++ Shared Library with Templates: Undefined Symbols Error
Combining Two Lists by Key Using Thrust