Fast Multiplication/Division by 2 for Floats and Doubles (C/C++)

Fast multiplication/division by 2 for floats and doubles (C/C++)

You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:

const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK; 
void unsafe_shl(double* d, int shift) { 
    unsigned long long* i = (unsigned long long*)d; 
    if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) { 
        *i += (unsigned long long)shift << DOUBLE_EXP_SHIFT; 
    } else if (*i) {
        *d *= (1 << shift);
    }
}

EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:

    double ds[0x1000];
    for (int i = 0; i != 0x1000; i++)
        ds[i] = 1.2;

    clock_t t = clock();

    for (int j = 0; j != 1000000; j++)
        for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
            ds[i] *= 1 << 4;
#else
            ((unsigned int*)&ds[i])[1] += 4 << 20;
#endif

    clock_t e = clock();

    printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);

In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of

movupd xmm0,xmmword ptr [ecx]  
lea    ecx,[ecx+10h]  
mulpd  xmm0,xmm1  
movupd xmmword ptr [ecx-10h],xmm0

Versus 2.4 seconds otherwise, with an inner loop of:

add dword ptr [ecx],400000h
lea ecx, [ecx+8]

Truly unexpected!

EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:

movsd xmm1,mmword ptr [esp+eax*8+38h]  
mulsd xmm1,xmm0  
movsd mmword ptr [esp+eax*8+38h],xmm1  
inc   eax

VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:

fld         qword ptr [esp+eax*8+38h]  
inc         eax  
fmul        st,st(1)  
fstp        qword ptr [esp+eax*8+30h]

I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.

How come multiplication is as fast as addition for C++ double type values?

On some processors, floating-point multiplication is as fast as addition because:

The hardware designers have put a lot of logic gates into the floating-point units.
Instructions may be divided into multiple stages which are executed in a pipeline. For example, a multiplication may perform part of its work in a unit M0, then pass the results to a unit M1 that does another part, then M2, then M3. While M1 is working on its part, M0 can start work on a different multiplication. With this arrangement, a multiplication may actually take four processor cycles to complete, but, because there are four units working on four stages, the processor can finish one multiplication every cycle. In contrast, a simpler instruction like XOR has just one stage.
Although some instructions could be executed quickly and some require more time, the entire processor is synchronized by a clock, and every pipeline stage in each execution unit has to finish its work in one clock cycle. This imposes some rigidity on the processor design—some simple operations will finish their work before a clock cycle is over, while complicated operations need the entire cycle. The designers make decisions about how long to make a clock cycle. If a clock cycle is too short (relative to the speed at which the logic gates work), then many instructions have to take multiple cycles and may require additional overhead to manage them. If a clock cycle is too long, then time is wasted needless waiting for instructions that could have completed sooner. For current processor technology, it is common that the floating-point multiplier stages work well with the processor cycle time.

Nonetheless, you may see differences between the times of addition and multiplication. Current processor designs are quite complicated, and processors typically have multiple units for doing various floating-point operations. A processor could have more units for doing addition than it does for doing multiplication, so it would be able to do more additions per unit of time than multiplications.

However, observe the expression you are using:

sum += q[i] + q[i - 1];

This causes sum to be serially dependent on its prior value. The processor can add q[i] to q[i-1] without waiting for prior additions, but then, to add to sum, it must wait for the prior add to sum to complete. This means that, if a processor has two units for addition, it could be working on both q[i] + q[i-1] and the prior addition to sum at the same time. But, if it had more addition units, it could not go any faster. It could use the extra units to do more of those q[i] + q[i - 1] additions for different values of i, but every addition to sum has to wait for the previous one. Therefore, with two or more addition units, this computation is dependent on the latency of addition, which is how long it takes to do a single addition. (This is in contrast to the throughput of addition, which is how many additions the processor can do per unit of time, if there is no serial dependency.)

If you used a different computation, such as sum += q[i]; or sum0 += q[i]; sum1 += q[i+1]; sum2 += q[i+2]; sum3 += q[i+3];, then you could see different times for addition and multiplication that depended on how many addition units and how many multiplication units the processor had.

division and multiplication by power of 2

It is trivial from the bit operations perspective. Multiplying by 2 is equivalent to a shift left by 1 bit, division is a right shift. similarly it is the same trivial to multiply and divide by any power of 2.

int a = 123;           // or in binary format: a = 0b01111011;

assert (a * 2) == (a << 1);   // 2 = 2^1, (a << 1) = 11110110
assert (a / 2) == (a >> 1);   // 2 = 2^1, (a >> 1) = 00111101

assert (a * 8) == (a << 3);   // 8 = 2^3, (a << 3) = 1111011000
assert (a / 8) == (a >> 3);   // 8 = 2^3, (a >> 3) = 0000001111

Also note that a*2 = a+a, and addition is sometimes even cheaper than shifting, depending on the hardware.

For signed division, beware that in some languages (such as C), integer division truncates towards zero, while arithmetic right shift (shifting in copies of the sign bit for 2's complement) always rounds towards -Infinity. e.g. (-5)/2 == -2, but (-5) >> 1 == -3. Implementing signed division by 2 can still be done with a shift + extra operations to add the sign bit to get the truncation behaviour. (Look at C compiler output.)

Floating point division vs floating point multiplication

Yes, many CPUs can perform multiplication in 1 or 2 clock cycles but division always takes longer (although FP division is sometimes faster than integer division).

If you look at this answer you will see that division can exceed 24 cycles.

Why does division take so much longer than multiplication? If you remember back to grade school, you may recall that multiplication can essentially be performed with many simultaneous additions. Division requires iterative subtraction that cannot be performed simultaneously so it takes longer. In fact, some FP units speed up division by performing a reciprocal approximation and multiplying by that. It isn't quite as accurate but is somewhat faster.

Simple math operations faster on double than on float datatype?

It has to do with how you generated the random numbers. Multiplying and dividing floating point numbers are not all the same; the actual values of those numbers matter. In the case of floats, you're populating a value over a fairly large range. If you create your floats such that they are between 0 and 1, just like the doubles, then it comes out more like you'd expect. Just change NextFloat to be this:

static float NextFloat(Random random)
{
    return (float) random.NextDouble();
}

I just ran a few tests, and with that change the floats were 33% faster at multiplication.

Of course, this is just the easiest way to make the comparison "fair". To get a better understanding of how floats really compare to doubles you'd want to generate random floats and doubles each between the full range of the respective types, or better yet, both holding values representing the type of data your program will be using.

A good way to do a fast divide in C++?

On just about every CPU, a floating point divide is several times as expensive as a floating point multiply, so multiplying by the inverse of your divisor is a good optimization. The downside is that there is a possibility that you will lose a very small portion of accuracy on certain processors - eg, on modern x86 processors, 64-bit float operations are actually internally computed using 80 bits when using the default FPU mode, and storing it off in a variable will cause those extra precision bits to be truncated according to your FPU rounding mode (which defaults to nearest). This only really matters if you are concatenating many float operations and have to worry about the error accumulation.

Fast Multiplication/Division by 2 for Floats and Doubles (C/C++)