Trial-Division Code Runs 2X Faster as 32-Bit on Windows Than 64-Bit on Linux

Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux

You don't say whether the windows/linux operating systems are 32 or 64 bit.

On a 64-bit linux machine, if you change the size_t to an int you'll find that execution times drop on linux to a similar value to those that you have for windows.

size_t is an int32 on win32, an int64 on win64.

EDIT: just seen your windows disassembly.

Your windows OS is the 32-bit variety (or at least you've compiled for 32-bit).

Intel(x86_64) 64 bit vs 32 bit integer arithmetic performance difference

In general 64-bit arithmetic is as fast as 32-bit, ignoring things like larger operands taking up more memory and BW (bandwidth), and on x86-64 addressing the full 64-bit registers requires longer instructions.

However, you have managed to hit one of the few exceptions to this rule, namely the div instruction for calculating divisions.

Division performance for a x32 ELF on a x64 OS

If you look at instruction timings for x86 processors, it turns out that on recent Intel processors, a 64-bit divide is 3-4x as expensive as a 32-bit divide -- and if you look at the internals of alldiv (link in the comments above), for your values which will always fit in 32 bits, it will use a single 32-bit divide...

Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?

You're asking about optimizing uint64_t / uint64_t C division to a 64b / 32b => 32b x86 asm division, when the divisor is known to be 32-bit. The compiler must of course avoid the possibility of a #DE exception on a perfectly valid (in C) 64-bit division, otherwise it wouldn't have followed the as-if rule. So it can only do this if it's provable that the quotient will fit in 32 bits.

Yes, that's a win or at least break-even. On some CPUs it's even worth checking for the possibility at runtime because 64-bit division is so much slower. But unfortunately current x86 compilers don't have an optimizer pass to look for this optimization even when you do manage to give them enough info that they could prove it's safe. e.g. if (edx >= ebx) __builtin_unreachable(); doesn't help last time I tried.

For the same inputs, 32-bit operand-size will always be at least as fast

16 or 8-bit could maybe be slower than 32 because they may have a false dependency writing their output, but writing a 32-bit register zero-extends to 64 to avoid that. (That's why mov ecx, ebx is a good way to zero-extend ebx to 64-bit, better than and a value that's not encodeable as a 32-bit sign-extended immediate, like harold pointed out). But other than partial-register shenanigans, 16-bit and 8-bit division are generally also as fast as 32-bit, or not worse.

On AMD CPUs, division performance doesn't depend on operand-size, just the data. 0 / 1 with 128/64-bit should be faster than worst-case of any smaller operand-size. AMD's integer-division instruction is only a 2 uops (presumably because it has to write 2 registers), with all the logic done in the execution unit.

16-bit / 8-bit => 8-bit division on Ryzen is a single uop (because it only has to write AH:AL = AX).

On Intel CPUs, div/idiv is microcoded as many uops. About the same number of uops for all operand-sizes up to 32-bit (Skylake = 10), but 64-bit is much much slower. (Skylake div r64 is 36 uops, Skylake idiv r64 is 57 uops). See Agner Fog's instruction tables: https://agner.org/optimize/

div/idiv throughput for operand-sizes up to 32-bit is fixed at 1 per 6 cycles on Skylake. But div/idiv r64 throughput is one per 24-90 cycles.

See also Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux for a specific performance experiment where modifying the REX.W prefix in an existing binary to change div r64 into div r32 made a factor of ~3 difference in throughput.

And Why does Clang do this optimization trick only from Sandy Bridge onward? shows clang opportunistically using 32-bit division when the dividend is small, when tuning for Intel CPUs. But you have a large dividend and a large-enough divisor, which is a more complex case. That clang optimization is still zeroing the upper half of the dividend in asm, never using a non-zero or non-sign-extended EDX.

I have failed to make the popular C compilers generate the latter code when dividing an unsigned 32-bit integer (shifted left 32 bits) by another 32-bit integer.

I'm assuming you cast that 32-bit integer to uint64_t first, to avoid UB and get a normal uint64_t / uint64_t in the C abstract machine.

That makes sense: Your way wouldn't be safe, it will fault with #DE when edx >= ebx. x86 division faults when the quotient overflows AL / AX / EAX / RAX, instead of silently truncating. There's no way to disable that.

So compilers normally only use idiv after cdq or cqo, and div only after zeroing the high half, unless you use an intrinsic or inline asm to open yourself up to the possibility of your code faulting. In C, x / y only faults if y = 0 (or for signed, INT_MIN / -1 is also allowed to fault¹).

GNU C doesn't have an intrinsic for wide division, but MSVC has _udiv64. (With gcc/clang, division wider than 1 register uses a helper function which does try to optimize for small inputs. But this doesn't help for 64/32 division on a 64-bit machine, where GCC and clang just use the 128/64-bit division instruction.)

Even if there were some way to promise the compiler that your divisor would be big enough to make the quotient fit in 32 bits, current gcc and clang don't look for that optimization in my experience. It would be a useful optimization for your case (if it's always safe), but compilers won't look for it.

Footnote 1: To be more specific, ISO C describes those cases as "undefined behaviour"; some ISAs like ARM have non-faulting division instructions. C UB means anything can happen, including just truncation to 0 or some other integer result. See Why does integer division by -1 (negative one) result in FPE? for an example of AArch64 vs. x86 code-gen and results. Allowed to fault doesn't mean required to fault.

uops for integer DIV instruction

Not all of those uops run on the actual divider unit on port 0. It seems only signed idiv is that many uops on Skylake, div r64 is "only" 33 uops. Perhaps signed idiv r64 is taking absolute values to do extended-precision division using a narrower HW divider unit, like you'd do for software extended-precision? (Why is __int128_t faster than long long on x86-64 GCC?)

And idiv/div r32 is "only" 10 uops, probably only 1 or 2 of them needing the actual divide unit on port 0, the others doing IDK what on other ports. Note the counts for arith.divider_active shown in Skylake profile results on Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux - div r64 with small inputs barely keeps the actual port 0 divider active for longer than div r32, but the other overhead makes it much slower.

FP division is actually single-uop because FP div performance is important in some real-world algorithms. (Especially effect of one divpd on front-end throughput of surrounding code). See Floating point division vs floating point multiplication

See also Do FP and integer division compete for the same throughput resources on x86 CPUs? - Ice Lake improves the divider HW.

See also discussion in comments clearing up other misconceptions.

Why is division more expensive than multiplication? division is fundamentally hard to pipeline.

I think I've read that modern divider units are typically built with an iterative not-fully-pipelined part, and then 2 Newton Raphson steps which are pipelined. So that's how division can be partially pipelined on modern CPUs: the next one can start as soon as the current one can move into the Newton-Raphson pipelined part of the execution unit.

How is this faster? 52-bit modulo multiply using FPU trick faster than inline ASM on x64

On pre-Icelake processors, such as Skylake, there is a big difference between a "full" 128bit-by-64bit division and a "half" 64bit-by-64bit division (where the upper qword is zero). The full one can take up to nearly 100 cycles (varies a bit depending on the value in rdx, but there is a sudden "cliff" when rdx is even set to 1), the half one is more around 30 to 40-ish cycles depending on the µarch.

The 64bit floating point division is (for a division) relatively fast at around 14 to 20 cycles depending on the µarch, so even with that and some other even-less-significant overhead thrown in, that's not enough to waste the 60 cycle advantage that the "half" division has compared to the "full" division. So the complicated floating point version can come out ahead.

Icelake apparently has an amazing divider that can do a full division in 18 cycles (and a "half" division isn't faster), the inline asm should be good on Icelake.

On AMD Ryzen, divisions with a non-zero upper qword seem to get slower more gradually as rdx gets higher (less of a "perf cliff").

Do FP and integer division compete for the same throughput resources on x86 CPUs?

Intel CPU architect Ronak Singhal mentions on Twitter that Broadwell (and by implication subsequent architectures until ICL) use the FP hardware for division, but that Ice Lake has a dedicated integer division unit:

Keep in mind that Broadwell that this was benchmarked on does integer division on the FP divider. In Ice Lake, there is now a dedicated integer divide unit.

So I would expect significant competition. Many of the operations that integer division perform no doubt are plain ALU ops not using the divider, so I wouldn't necessarily expect their inverse throughput to be strictly cumulative but they will definitely compete.

Ronak doesn't imply anything about pre-Broadwell implementation, but based on the similar port assignment and performance going back to at least Sandy Bridge, I think we can expect that the same sharing holds.

Trial-Division Code Runs 2X Faster as 32-Bit on Windows Than 64-Bit on Linux