How to Force the Use of Cmov in Gcc and VS

how to force the use of cmov in gcc and VS

As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov.

The usual __builtin_expect probably cannot do much into pushing gcc to use cmov, given that cmov is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.

Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:

    __asm__ (
"comiss %[xi_mid],%[z]\n"
"cmovb %[mid],%[hi]\n"
"cmovae %[mid],%[lo]\n"
: [hi] "+r"(hi), [lo] "+r"(lo)
: [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
: "cc"
);

The used constraints are:

  • hi and lo are into the "write" variables list, with +r constraint as cmov can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use =, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after our asm block);
  • mid is in the "read" list, rm as cmov can take either a register or a memory operand as input value;
  • xi[mid] and z are in the "read" list;

    • z has the special x constraint that means "any SSE register" (required for ucomiss first operand);
    • xi[mid] has xm, as the second ucomiss operand allows a memory operator; given the choice between z and xi[mid], I chose the last one as a better candidate for being taken directly from memory, given that z is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) and xi[mid] is used just in this comparison;
  • cc (the FLAGS register) is in the "clobber" list - we do clobber the flags and nothing else.

How to force compiler to generate conditional move by using inline assembly

You usually don't need inline asm for this, but the problem with yours is that [low] "=&r"(low) is a write-only output!! You're telling the compiler that the old value of the variable is irrelevant, it's write-only. But that's not true for cmov: it's a 3-input instruction: src, dst, and FLAGS.

Use "+&r" read/write operands for low/high (the cmov destinations). Turns out this is basically a duplicate of how to force the use of cmov in gcc and VS which shows working inline asm almost identical to yours (using an FP compare instead of integer, but the same cmov instructions.)

That should work, but force the compiler to use more registers than it would otherwise need. (e.g. it could use LEA after CMP to do the mid+1 and mid-1, overwriting mid. Or even use LEA + INC because the B and AE conditions only read CF, and new enough CPUs don't have partial-flag stalls at all, they just have cmov read both parts of FLAGS separately if necessary. That's why cmovbe is a 2-uop instruction on Skylake, vs. 1 for most other conditions.)



Getting the compiler to go branchless:

Have you tried ternary operators, like low = (middle < key) ? new_low : low;? Have you tried profile-guided optimization, so GCC can see that the branch is in fact not very predictable? (gcc optimization flag -O3 makes code slower than -O2 shows that if-conversion to branchless usually is only done at -O3, at least in some cases).

Also, keep in mind that branchless creates a data dependency, which can be worse for a binary search; speculative execution effectively gives you memory parallelism / prefetch instead of serializing loads. (https://agner.org/optimize/)

For a small binary search that hits in L1d cache, branch misses cost more than load-use latency. But once you expect some L2 cache misses, branch recovery is cheaper than load-use latency from L3. About the branchless binary search.

Software prefetch of the 1/4 and 3/4 elements can help mitigate this, hiding load-use latency. This takes advantage of memory-level parallelism (having multiple loads in flight at once) even for the branchless form where there's a data dependency through the chain of demand loads.


If you can choose a different data structure, What is the most efficient way to implement a BST in such a way the find(value) function is optimized for random values in the tree on x86? shows that a wide implicit tree should be very quickly searchable using SIMD to balance computation against latency, leading to few steps to get to a hit. It is ordered so you can traverse it in order, or find the prev or next element after getting a hit.


Related:

  • Conditional move (cmov) in GCC compiler
  • Make gcc use conditional moves

Convert GCC Inline Assembly CMOV to Visual Studio Assembler

refactoring the code in order to better express intent to the compiler:

int binary_cmov (const int *arr, int n, int key) 
{
int min = 0, max = n;
while (min < max)
{
int middle = (min + max) >> 1;
min = key > arr[middle] ? middle + 1 : min;
max = key > arr[middle] ? max : middle;
}
return min;
}

With gcc5.3 -O3, yields:

binary_cmov(int const*, int, int):
xorl %eax, %eax
testl %esi, %esi
jle .L4
.L3:
leal (%rax,%rsi), %ecx
sarl %ecx
movslq %ecx, %r8
leal 1(%rcx), %r9d
movl (%rdi,%r8,4), %r8d
cmpl %edx, %r8d
cmovl %r9d, %eax
cmovge %ecx, %esi
cmpl %eax, %esi
jg .L3
rep ret
.L4:
rep ret

moral of the story - don't embed assembler. All you do is make the code non-portable.

going further...

why not express intent even more explicitly?

#include <utility>

template<class...Ts>
auto sum(Ts...ts)
{
std::common_type_t<Ts...> result = 0;
using expand = int[];
void(expand{ 0, ((result += ts), 0)... });
return result;
}

template<class...Ts>
auto average(Ts...ts)
{
return sum(ts...) / sizeof...(Ts);
}

int binary_cmov (const int *arr, int n, int key)
{
int min = 0, max = n;
while (min < max)
{
int middle = average(min, max);
auto greater = key > arr[middle];
min = greater ? middle + 1 : min;
max = greater ? max : middle;
}
return min;
}

compiler output:

binary_cmov(int const*, int, int):
xorl %eax, %eax
testl %esi, %esi
jle .L4
.L3:
leal (%rax,%rsi), %ecx
movslq %ecx, %rcx
shrq %rcx
movslq %ecx, %r8
leal 1(%rcx), %r9d
movl (%rdi,%r8,4), %r8d
cmpl %edx, %r8d
cmovl %r9d, %eax
cmovge %ecx, %esi
cmpl %eax, %esi
jg .L3
rep ret
.L4:
rep ret

Why does cmov always return t_val?

You're missing an early-clobber on the output ("=&r"), so probably GCC picks the same register for the output as one of the inputs, probably pred. So test %1,%1 is probably testing t_val (b). Single-step the asm with a debugger, and/or look at GCC's asm output. (On https://godbolt.org/ or with gcc -S).

This seems really inefficient; Use a "+r"(result) constraint for the output (with uint32_t result=t_val;) so you don't need a mov in the asm template; let the compiler get result=t_val done for you, possibly by simply choosing the same register.

The test %2,%2 after the cmoz is also doing nothing; you're not even using a GCC6 flag-output operand. It's a totally wasted instruction.

Also, this doesn't need to be volatile. The output is a pure function of the inputs, and doesn't need to run at all if the output is unused.

It's probably a bad idea to use inline asm at all for just a cmov; compile with -O3 and write your source such that GCC thinks its a good idea to do if-conversion into branchless code. inline asm destroys constant propagation and defeats other optimizations. And this way forces you to use a test instruction in the template, not reading FLAGS set from some earlier add or whatever, and not letting the compiler reuse the same FLAGS result for multiple cmov or other instructions. https://gcc.gnu.org/wiki/DontUseInlineAsm

Or if you can't hand-hold the compiler into making asm you want, write more of your real use-case in asm, not just a wrapper around cmov.

Generating CMOV instructions using Microsoft compilers

It is extremely difficult, if not downright impossible, to get Microsoft's 32-bit C/C++ compiler to emit CMOVcc instructions.

What you have to remember is that conditional moves were first introduced with the Pentium Pro processor, and while Microsoft had a compiler switch that would tune the generated code for this 6th generation processor (the long-deprecated /G6), they never emitted code that would run exclusively on this processor. The code still needed to run on 5th generation processors (i.e., Pentium and AMD K6), so it couldn't use CMOVcc instructions because those would have generated illegal instruction exceptions. Unlike Intel's compiler, global dynamic dispatching was not (and is still not) implemented.

Also, it is worth noting that no switch was ever introduced to target exclusively 6th generation processors and later. There's no /arch:CMOV or whatever they might call it. Supported values for the /arch switch go straight from IA32 (the lowest common denominator, for which CMOV would be potentially illegal) to SSE. However, the documentation does confirm that, as one might expect, enabling SSE or SSE2 code generation implicitly enables the use of conditional-move instructions and anything else that was introduced before SSE:

In addition to using the SSE and SSE2 instructions, the compiler also uses other instructions that are present on the processor revisions that support SSE and SSE2. An example is the CMOV instruction that first appeared on the Pentium Pro revision of the Intel processors.

Therefore, in order to have any hope of getting the compiler to emit CMOV instructions, you must set /arch:SSE or higher. Nowadays, of course, this is no big deal. You can simply set /arch:SSE or /arch:SSE2 and be safe, since all modern processors support these instruction sets.

But that's only half of the battle. Even when you have the right compiler switches enabled, it is extremely difficult to get MSVC to emit CMOV instructions. Here are two important observations:

  1. MSVC 10 (Visual Studio 2010) and earlier virtually never generate CMOV instructions. I've never seen them in the output, no matter how many variations of source code I've tried. I say "virtually" because there might be some crazy edge case that I missed, but I very much doubt it. None of the optimization flags have any effect on this.

    However, MSVC 11 (Visual Studio 2012) introduced significant improvements to the code generator, at least in this aspect. This and later versions of the compiler now seem to be at least aware of the existence of the CMOVcc instructions, and may emit them under the right conditions (i.e., /arch:SSE or later, and use of the conditional operator, as described below).

  2. I've found that the most effective way to cajole the compiler into emitting a CMOV instruction is to use the conditional operator instead of a long-form if-else statement. Although these two constructs should be completely equivalent as far as the code generator is concerned, they are not.

    In other words, while you might see the following translated to a branchless CMOVLE instruction:

    int value = (a < b) ? a : b;

    you will always get branching code for the following sequence:

    int value;
    if (a < b) value = a;
    else value = b;

    At the very least, even if your use of the conditional operator doesn't cause a CMOV instruction (such as on MSVC 10 or earlier), you might still be lucky enough to get branchless code by some other means—e.g., SETcc or clever use of SBB and NEG/NOT/INC/DEC. This is what the disassembly you've shown in the question uses, and although it is not quite as optimal as CMOVcc, it's certainly comparable and the difference is not worth worrying about. (The only other branching instruction is part of the loop.)


If you truly want branchless code (which you often do when hand-optimizing), and you're not having any luck getting the compiler to generate the code you want, you'll need to get more clever in how you write the source code. I've had good luck with writing code that computes the result branchlessly using bitwise or arithmetic operators.

For example, you might wish that the following function generated optimal code:

int Minimum(int a, int b)
{
return (a < b) ? a : b;
}

You followed rule #2 and used a conditional operator, but if you're using an older version of the compiler, you'll get branching code anyway. Outsmart the compiler using the classic trick:

int Minimum_Optimized(int a, int b)
{
return (b + ((a - b) & -(a < b)));
}

The resulting object code is not perfectly optimal (it contains a CMP instruction that is redundant since SUB already sets the flags), but it is branchless and will therefore still be significantly faster than the original attempt on random inputs that cause branch prediction to fail.

As another example, imagine that you want to determine whether a 64-bit integer is negative in a 32-bit application. You write the following self-evident code:

bool IsNegative(int64_t value)
{
return (value < 0);
}

and will find yourself sorely disappointed in the results. GCC and Clang optimize this sensibly, but MSVC spits out a nasty conditional branch. The (non-portable) trick is realizing that the sign bit is in the upper 32 bits, so you can isolate and test that explicitly using bitwise manipulation:

bool IsNegative_Optimized(int64_t value)
{
return (static_cast<int32_t>((value & 0xFFFFFFFF00000000ULL) >> 32) < 0);
}

In addition, one of the commentators suggests using inline assembly. While this is possible (Microsoft's 32-bit compiler supports inline assembly), it is often a poor choice. Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.


Note that none of these contortions are necessary if you target x86-64. Microsoft's 64-bit C/C++ compiler is significantly more aggressive about using CMOVcc instructions whenever possible, even the older versions. As this blog post explains, the x64 compiler bundled with Visual Studio 2010 contains a number of code-quality improvements, including better identification and use of CMOV instructions.

No special compiler flags or other considerations are necessary here, since all processors that support 64-bit mode support conditional moves. I suppose this is why they were able to get it right for the 64-bit compiler. I also suspect that some of these changes made to the x86-64 compiler in VS 2010 were ported to the x86-32 compiler in VS 2012, explaining why it is at least aware of the existence of CMOV, but it still doesn't use it as aggressively as the 64-bit compiler.

The bottom line is, when targeting x86-64, write the code in the way that makes the most sense. The optimizer actually knows how to do its job!



Related Topics



Leave a reply



Submit