Does a C/C++ Compiler Optimize Constant Divisions by Power-Of-Two Value into Shifts

Does a c/c++ compiler optimize constant divisions by power-of-two value into shifts?

Even with g++ -O0 (yes, -O0!), this happens. Your function compiles down to:

_Z3divm:
.LFB952:
pushq %rbp
.LCFI0:
movq %rsp, %rbp
.LCFI1:
movq %rdi, -24(%rbp)
movq $64, -8(%rbp)
movq -24(%rbp), %rax
shrq $6, %rax
leave
ret

Note the shrq $6, which is a right shift by 6 places.

With -O1, the unnecessary junk is removed:

_Z3divm:
.LFB1023:
movq %rdi, %rax
shrq $6, %rax
ret

Results on g++ 4.3.3, x64.

Is multiplication and division using shift operators in C actually faster?

Short answer: Not likely.

Long answer:
Your compiler has an optimizer in it that knows how to multiply as quickly as your target processor architecture is capable. Your best bet is to tell the compiler your intent clearly (i.e. i*2 rather than i << 1) and let it decide what the fastest assembly/machine code sequence is. It's even possible that the processor itself has implemented the multiply instruction as a sequence of shifts & adds in microcode.

Bottom line--don't spend a lot of time worrying about this. If you mean to shift, shift. If you mean to multiply, multiply. Do what is semantically clearest--your coworkers will thank you later. Or, more likely, curse you later if you do otherwise.

Which is better option to use for dividing an integer number by 2?

Use the operation that best describes what you are trying to do.

  • If you are treating the number as a sequence of bits, use bitshift.
  • If you are treating it as a numerical value, use division.

Note that they are not exactly equivalent. They can give different results for negative integers. For example:

-5 / 2  = -2
-5 >> 1 = -3

(ideone)

Optimization of float power of 2 division

If you are multiplying or dividing by a constant, a compiler of modest quality should optimize it. On many platforms, a hardware multiply instruction may be optimal.

For multiplying (or dividing) by a power of two, std::ldexp(x, p) multiplies x by 2p, where p is an int (and divides if p is negated). I would not expect much benefit over simple multiplication on most platforms, as manual (software) exponent manipulation must include checks for overflow and underflow, so the resulting sequence of instructions is not likely to improve over a hardware multiply in most situations.

Multiply by 0.5 rather than divide by 2

It's true that some (if not most) processors can multiply faster than performing a division operation, but, it's like the myth of ++i being faster than i++ in a for loop. Yes, it once was, but nowadays, compilers are smart enough to optimize all those things for you, so you should not care about this anymore.

And about bit-shifting, it once was faster to shift << 2 than to multiply by 4, but those days are over as most processors can multiply in one clock cycle, just like a shift operation.

A great example of this was the calculation of the pixel address in VGA 320x240 mode. They all did this:

address = x + (y << 8) + (y << 6)

to multiply y with 320. On modern processors, this can be slower than just doing:

address = x + y * 320;

So, just write what you think and the compiler will do the rest :)

Efficiently dividing unsigned value by a power of two, rounding up

This answer is about what's ideal in asm; what we'd like to convince the compiler to emit for us. (I'm not suggesting actually using inline asm, except as a point of comparison when benchmarking compiler output. https://gcc.gnu.org/wiki/DontUseInlineAsm).

I did manage to get pretty good asm output from pure C for ceil_div_andmask, see below. (It's worse than a CMOV on Broadwell/Skylake, but probably good on Haswell. Still, the user23/chux version looks even better for both cases.) It's mostly just worth mentioning as one of the few cases where I got the compiler to emit the asm I wanted.


It looks like Chris Dodd's general idea of return ((p-1) >> lg(q)) + 1 with special-case handling for d=0 is one of the best options. I.e. the optimal implementation of it in asm is hard to beat with an optimal implementation of anything else. Chux's (p >> lg(q)) + (bool)(p & (q-1)) also has advantages (like lower latency from p->result), and more CSE when the same q is used for multiple divisions. See below for a latency/throughput analysis of what gcc does with it.

If the same e = lg(q) is reused for multiple dividends, or the same dividend is reused for multiple divisors, different implementations can CSE more of the expression. They can also effectively vectorize with AVX2.

Branches are cheap and very efficient if they predict very well, so branching on d==0 will be best if it's almost never taken. If d==0 is not rare, branchless asm will perform better on average. Ideally we can write something in C that will let gcc make the right choice during profile-guided optimization, and compiles to good asm for either case.

Since the best branchless asm implementations don't add much latency vs. a branchy implementation, branchless is probably the way to go unless the branch would go the same way maybe 99% of the time. This might be likely for branching on p==0, but probably less likely for branching on p & (q-1).


It's hard to guide gcc5.4 into emitting anything optimal. This is my work-in-progress on Godbolt).

I think the optimal sequences for Skylake for this algorithm are as follows. (Shown as stand-alone functions for the AMD64 SysV ABI, but talking about throughput/latency on the assumption that the compiler will emit something similar inlined into a loop, with no RET attached).


Branch on carry from calculating d-1 to detect d==0, instead of a separate test & branch. Reduces the uop count nicely, esp on SnB-family where JC can macro-fuse with SUB.

ceil_div_pjc_branch:
xor eax,eax ; can take this uop off the fast path by adding a separate xor-and-return block, but in reality we want to inline something like this.
sub rdi, 1
jc .d_was_zero ; fuses with the sub on SnB-family
tzcnt rax, rsi ; tzcnt rsi,rsi also avoids any false-dep problems, but this illustrates that the q input can be read-only.
shrx rax, rdi, rax
inc rax
.d_was_zero:
ret
  • Fused-domain uops: 5 (not counting ret), and one of them is an xor-zero (no execution unit)
  • HSW/SKL latency with successful branch prediction:

    • (d==0): No data dependency on d or q, breaks the dep chain. (control dependency on d to detect mispredicts and retire the branch).
    • (d!=0): q->result: tzcnt+shrx+inc = 5c
    • (d!=0): d->result: sub+shrx+inc = 3c
  • Throughput: probably just bottlenecked on uop throughput

I've tried but failed to get gcc to branch on CF from the subtract, but it always wants to do a separate comparison. I know gcc can be coaxed into branching on CF after subtracting two variables, but maybe this fails if one is a compile-time constant. (IIRC, this typically compiles to a CF test with unsigned vars: foo -= bar; if(foo>bar) carry_detected = 1;)


Branchless with ADC / SBB to handle the d==0 case. Zero-handling adds only one instruction to the critical path (vs. a version with no special handling for d==0), but also converts one other from an INC to a sbb rax, -1 to make CF undo the -= -1. Using a CMOV is cheaper on pre-Broadwell, but takes extra instructions to set it up.

ceil_div_pjc_asm_adc:
tzcnt rsi, rsi
sub rdi, 1
adc rdi, 0 ; d? d-1 : d. Sets CF=CF
shrx rax, rdi, rsi
sbb rax, -1 ; result++ if d was non-zero
ret
  • Fused-domain uops: 5 (not counting ret) on SKL. 7 on HSW
  • SKL latency:

    • q->result: tzcnt+shrx+sbb = 5c
    • d->result: sub+adc+shrx(dep on q begins here)+sbb = 4c
  • Throughput: TZCNT runs on p1. SBB, ADC, and SHRX only run on p06. So I think we bottleneck on 3 uops for p06 per iteration, making this run at best one iteration per 1.5c.

If q and d become ready at the same time, note that this version can run SUB/ADC in parallel with the 3c latency of TZCNT. If both are coming from the same cache-miss cache line, it's certainly possible. In any case, introducing the dep on q as late as possible in the d->result dependency chain is an advantage.

Getting this from C seems unlikely with gcc5.4. There is an intrinsic for add-with-carry, but gcc makes a total mess of it. It doesn't use immediate operands for ADC or SBB, and stores the carry into an integer reg between every operation. gcc7, clang3.9, and icc17 all make terrible code from this.

#include <x86intrin.h>
// compiles to completely horrible code, putting the flags into integer regs between ops.
T ceil_div_adc(T d, T q) {
T e = lg(q);
unsigned long long dm1; // unsigned __int64
unsigned char CF = _addcarry_u64(0, d, -1, &dm1);
CF = _addcarry_u64(CF, 0, dm1, &dm1);
T shifted = dm1 >> e;
_subborrow_u64(CF, shifted, -1, &dm1);
return dm1;
}
# gcc5.4 -O3 -march=haswell
mov rax, -1
tzcnt rsi, rsi
add rdi, rax
setc cl
xor edx, edx
add cl, -1
adc rdi, rdx
setc dl
shrx rdi, rdi, rsi
add dl, -1
sbb rax, rdi
ret

CMOV to fix the whole result: worse latency from q->result, since it's used sooner in the d->result dep chain.

ceil_div_pjc_asm_cmov:
tzcnt rsi, rsi
sub rdi, 1
shrx rax, rdi, rsi
lea rax, [rax+1] ; inc preserving flags
cmovc rax, zeroed_register
ret
  • Fused-domain uops: 5 (not counting ret) on SKL. 6 on HSW
  • SKL latency:

    • q->result: tzcnt+shrx+lea+cmov = 6c (worse than ADC/SBB by 1c)
    • d->result: sub+shrx(dep on q begins here)+lea+cmov = 4c
  • Throughput: TZCNT runs on p1. LEA is p15. CMOV and SHRX are p06. SUB is p0156. In theory only bottlenecked on fused-domain uop throughput, so one iteration per 1.25c. With lots of independent operations, resource conflicts from SUB or LEA stealing p1 or p06 shouldn't be a throughput problem because at 1 iter per 1.25c, no port is saturated with uops that can only run on that port.

CMOV to get an operand for SUB: I was hoping I could find a way to create an operand for a later instruction that would produce a zero when needed, without an input dependency on q, e, or the SHRX result. This would help if d is ready before q, or at the same time.

This doesn't achieve that goal, and needs an extra 7-byte mov rdx,-1 in the loop.

ceil_div_pjc_asm_cmov:
tzcnt rsi, rsi
mov rdx, -1
sub rdi, 1
shrx rax, rdi, rsi
cmovnc rdx, rax
sub rax, rdx ; res += d ? 1 : -res
ret

Lower-latency version for pre-BDW CPUs with expensive CMOV, using SETCC to create a mask for AND.

ceil_div_pjc_asm_setcc:
xor edx, edx ; needed every iteration

tzcnt rsi, rsi
sub rdi, 1

setc dl ; d!=0 ? 0 : 1
dec rdx ; d!=0 ? -1 : 0 // AND-mask

shrx rax, rdi, rsi
inc rax
and rax, rdx ; zero the bogus result if d was initially 0
ret

Still 4c latency from d->result (and 6 from q->result), because the SETC/DEC happen in parallel with the SHRX/INC. Total uop count: 8. Most of these insns can run on any port, so it should be 1 iter per 2 clocks.

Of course, for pre-HSW, you also need to replace SHRX.

We can get gcc5.4 to emit something nearly as good: (still uses a separate TEST instead of setting mask based on sub rdi, 1, but otherwise the same instructions as above). See it on Godbolt.

T ceil_div_andmask(T p, T q) {
T mask = -(T)(p!=0); // TEST+SETCC+NEG
T e = lg(q);
T nonzero_result = ((p-1) >> e) + 1;
return nonzero_result & mask;
}

When the compiler knows that p is non-zero, it takes advantage and makes nice code:

// http://stackoverflow.com/questions/40447195/can-i-hint-the-optimizer-by-giving-the-range-of-an-integer
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#define assume(x) do{if(!(x)) __builtin_unreachable();}while(0)
#else
#define assume(x) (void)(x) // still evaluate it once, for side effects in case anyone is insane enough to put any inside an assume()
#endif

T ceil_div_andmask_nonzerop(T p, T q) {
assume(p!=0);
return ceil_div_andmask(p, q);
}
# gcc5.4 -O3 -march=haswell
xor eax, eax # gcc7 does tzcnt in-place instead of wasting an insn on this
sub rdi, 1
tzcnt rax, rsi
shrx rax, rdi, rax
add rax, 1
ret

Chux / user23_variant

only 3c latency from p->result, and constant q can CSE a lot.

T divide_A_chux(T p, T q) {
bool round_up = p & (q-1); // compiles differently from user23_variant with clang: AND instead of
return (p >> lg(q)) + round_up;
}

xor eax, eax # in-place tzcnt would save this
xor edx, edx # target for setcc
tzcnt rax, rsi
sub rsi, 1
test rsi, rdi
shrx rdi, rdi, rax
setne dl
lea rax, [rdx+rdi]
ret

Doing the SETCC before TZCNT would allow an in-place TZCNT, saving the xor eax,eax. I haven't looked at how this inlines in a loop.

  • Fused-domain uops: 8 (not counting ret) on HSW/SKL
  • HSW/SKL latency:

    • q->result: (tzcnt+shrx(p) | sub+test(p)+setne) + lea(or add) = 5c
    • d->result: test(dep on q begins here)+setne+lea = 3c. (the shrx->lea chain is shorter, and thus not the critical path)
  • Throughput: Probably just bottlenecked on the frontend, at one iter per 2c. Saving the xor eax,eax should speed this up to one per 1.75c (but of course any loop overhead will be part of the bottleneck, because frontend bottlenecks are like that).

Repeated integer division by a runtime constant value

There is a library for this—libdivide:

libdivide is an open source library for optimizing integer division

libdivide allows you to replace expensive integer divides with
comparatively cheap multiplication and bitshifts. Compilers usually do
this, but only when the divisor is known at compile time. libdivide
allows you to take advantage of it at runtime. The result is that
integer division can become faster - a lot faster. Furthermore,
libdivide allows you to divide an SSE2 vector by a runtime constant,
which is especially nice because SSE2 has no integer division
instructions!

libdivide is free and open source with a permissive license. The name
"libdivide" is a bit of a joke, as there is no library per se: the
code is packaged entirely as a single header file, with both a C and a
C++ API.

You can read about the algorithm behind it at this blog; for example, this entry.

Basically, the algorithm behind it is the same one that compilers use to optimize division by a constant, except that it allows these strength-reduction optimizations to be done at run-time.

Note: you can create an even faster version of libdivide. The idea is that for every divisor, you can always create a triplet (mul/add/shift), so this expression gives the result: (num*mul+add)>>shift (multiply is a wide multiply here). Interestingly, this method could beat the compiler version for constant division for several microbenchmarks!


Here's my implementation (this is not compilable out of the box, but the general algorithm can be seen):

struct Divider_u32 {
u32 mul;
u32 add;
s32 shift;

void set(u32 divider);
};

void Divider_u32::set(u32 divider) {
s32 l = indexOfMostSignificantBit(divider);
if (divider&(divider-1)) {
u64 m = static_cast<u64>(1)<<(l+32);
mul = static_cast<u32>(m/divider);

u32 rem = static_cast<u32>(m)-mul*divider;
u32 e = divider-rem;

if (e<static_cast<u32>(1)<<l) {
mul++;
add = 0;
} else {
add = mul;
}
shift = l;
} else {
if (divider==1) {
mul = 0xffffffff;
add = 0xffffffff;
shift = 0;
} else {
mul = 0x80000000;
add = 0;
shift = l-1;
}
}
}

u32 operator/(u32 v, const Divider_u32 &div) {
u32 t = static_cast<u32>((static_cast<u64>(v)*div.mul+div.add)>>32)>>div.shift;

return t;
}


Related Topics



Leave a reply



Submit