Floating Point Division VS Floating Point Multiplication

Floating point division vs floating point multiplication

Yes, many CPUs can perform multiplication in 1 or 2 clock cycles but division always takes longer (although FP division is sometimes faster than integer division).

If you look at this answer you will see that division can exceed 24 cycles.

Why does division take so much longer than multiplication? If you remember back to grade school, you may recall that multiplication can essentially be performed with many simultaneous additions. Division requires iterative subtraction that cannot be performed simultaneously so it takes longer. In fact, some FP units speed up division by performing a reciprocal approximation and multiplying by that. It isn't quite as accurate but is somewhat faster.

Difference in accuracy with floating point division vs multiplication

There's nothing to be gained from this. You are only changing the scale but you'd don't get any more significant figures in your calculation.

The Wikipedia article on variance explains at a high level some of the options for calculation variance in a robust fashion.

Should I use multiplication or division for recurring floats?

For decimals that recur when inverted (ie. 1 / num is a recurring decimal), I use division instead of multiplication (example x / 2.2 instead of x * 0.45454545454).

It is also common knowledge that 22/10 is not representable exactly in binary floating-point, so all you are achieving, instead of multiplying by a slightly inaccurate value, is dividing by a slightly inaccurate value.

In fact, if the intent is to divide by 22/10 or some other real value that isn't necessarily exactly representable in binary floating-point, then half the times, the multiplication is more accurate than the division, because it happens by coincidence that the relative error for 1/X is less than the relative error for X.

Another remark is that your micro-benchmark runs into subnormal numbers, where the timings are not representative of timings for the usual operations on normal floating-point numbers, and after a short while, value is zero, which again means that the timings are not representative of the reality of multiplying and dividing normal numbers. And as Mark Ransom says, you should at least make the operands the same for both measurements: as currently written all the multiplications take a zero operand and result in zero. Also since 2.2 and 0.45454545454 both have type double, your benchmark is measuring double-precision multiplication and division, and if you are willing to implement a single-precision division by a double-precision multiplication, this needs not involve any loss of accuracy (but you would have to provide more digits for 1/2.2).

But don't let yourself be fooled into trying to fix the micro-benchmark. You don't need it, because there is no trade-off when X is no more exactly representable than 1/X. There is no reason not to use multiplication.

Note: you should explicitly multiply by 1 / X because since the two operations / X and * (1 / X) are very slightly different, the compiler is not able to do the replacement itself. On the other hand you don't need to replace / 2 by * 0.5 because any compiler worth its salt should do that for you.

Why float division is faster than integer division in c++?

Floating point number division is faster than integer division because of the exponent part in floating point number representation. To divide one exponent by another one plain subtraction is used.

int32_t division requires fast division of 31-bit numbers, whereas float division requires fast division of 24-bit mantissas (the leading one in mantissa is implied and not stored in a floating point number) and faster subtraction of 8-bit exponents.

See an excellent detailed explanation how division is performed in CPU.

It may be worth mentioning that SSE and AVX instructions only provide floating point division, but no integer division. SSE instructions/intrinsincs can be used to quadruple the speed of your float calculation easily.

If you look into Agner Fog's instruction tables, for example, for Skylake, the latency of the 32-bit integer division is 26 CPU cycles, whereas the latency of the SSE scalar float division is 11 CPU cycles (and, surprisingly, it takes the same time to divide four packed floats).

Also note, in C and C++ there is no division on numbers shorter that int, so that uint8_t and uint16_t are first promoted to int and then the division of ints happens. uint8_t division looks faster than int because it has fewer bits set when converted to int which causes the division to complete faster.

What's the relative speed of floating point add vs. floating point multiply

It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you'll get maximum throughput if all of them are filled all the time. So, executing a loop of mul's is just as fast as executing a loop or adds - but the same doesn't hold if the expression becomes more complex.

For example, take this loop:

for(int j=0;j<NUMITER;j++) {
for(int i=1;i<NUMEL;i++) {
bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ;
}
}

for NUMITER=10^7, NUMEL=10^2, both arrays initialized to small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles on a 64-bit proc. If I replace the loop with

bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;

It only takes 1.7 seconds... so since we "overdid" the additions, the muls were essentially free; and the reduction in additions helped. It get's more confusing:

bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;

-- same mul/add distribution, but now the constant is added in rather than multiplied in -- takes 3.7 seconds. Your processor is likely optimized to perform typical numerical computations more efficiently; so dot-product like sums of muls and scaled sums are about as good as it gets; adding constants isn't nearly as common, so that's slower...

bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/

again takes 1.7 seconds.

bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/

(same as initial loop, but without expensive constant addition: 2.1 seconds)

bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/

(mostly muls, but one addition:1.9 seconds)

So, basically; it's hard to say which is faster, but if you wish to avoid bottlenecks, more important is to have a sane mix, avoid NaN or INF, avoid adding constants. Whatever you do, make sure you test, and test various compiler settings, since often small changes can just make the difference.

Some more cases:

bla *= someval; // someval very near 1.0; takes 2.1 seconds
bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds
bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86
bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86
bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86

Integer division, or float multiplication?

You've said all the values are dynamic, which makes a difference. For the specific values 5 * j / 4, the integer operations are going to be blindingly fast, because pretty much the worst case is that the compiler optimises them to two shifts and one addition, plus some messing around to cope with the possibility that j is negative. If the CPU can do better (single-cycle integer multiplication or whatever) then the compiler typically knows about it. The limits of compilers' abilities to optimize this kind of thing basically come when you're compiling for a wide family of CPUs (generating lowest-common-denominator ARM code, for example), where the compiler doesn't really know much about the hardware and therefore can't always make good choices.

I suppose that if a and b are fixed for a while (but not known at compile time), then it's possible that computing k = double(a) / b once and then int(k * x) for many different values of x, might be faster than computing a * x / b for many different values of x. I wouldn't count on it.

If all the values vary each time, then it seems unlikely that the floating-point division to compute the 1.25, followed by floating-point multiplication, is going to be any faster than the integer multiplication followed by integer division. But you never know, test it.

It's not really possible to give simple relative timings for this on modern processors, it really depends a lot on the surrounding code. The main costs in your code often aren't the "actual" ops: it's "invisible" stuff like instruction pipelines stalling on dependencies, or spilling registers to stack, or function call overhead. Whether or not the function that does this work can be inlined might easily make more difference than how the function actually does it. As far as definitive statements of performance are concerned you can basically test real code or shut up. But the chances are that if your values start as integers, doing integer ops on them is going to be faster than converting to double and doing a similar number of double ops.

Floating number calculation precision: division vs multiplication

No it's just an attempt to make it faster as multiplications are often faster than divides.

Should I use multiplication or division?

1 divide and 2 multiplies vs 2 divides may not always be faster but it's almost certainly the reason.

If the vector was 3 or more dimensions I'd be more convinced it's worth it but always profile for these micro optimisations.

The rounding errors on the extra operations are causing the different results.



Related Topics



Leave a reply



Submit