Is ≪ Faster Than ≪=

Is faster than =?

No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:

  • A test or cmp instruction, which sets EFLAGS
  • And a Jcc (jump) instruction, depending on the comparison type (and code layout):
  • jne - Jump if not equal --> ZF = 0
  • jz - Jump if zero (equal) --> ZF = 1
  • jg - Jump if greater --> ZF = 0 and SF = OF
  • (etc...)

Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c

    if (a < b) {
// Do something 1
}

Compiles to:

    mov     eax, DWORD PTR [esp+24]      ; a
cmp eax, DWORD PTR [esp+28] ; b
jge .L2 ; jump if a is >= b
; Do something 1
.L2:

And

    if (a <= b) {
// Do something 2
}

Compiles to:

    mov     eax, DWORD PTR [esp+24]      ; a
cmp eax, DWORD PTR [esp+28] ; b
jg .L5 ; jump if a is > b
; Do something 2
.L5:

So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.


I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.

Latency — The number of clock cycles that are required for the
execution core to complete the execution of all of the μops that form
an instruction.

Throughput — The number of clock cycles required to
wait before the issue ports are free to accept the same instruction
again. For many instructions, the throughput of an instruction can be
significantly less than its latency

The values for Jcc are:

      Latency   Throughput
Jcc N/A 0.5

with the following footnote on Jcc:


  1. Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.

So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.

If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)


Edit: Floating Point

This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)

        fld     QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
fstp st(0)
seta al ; Set al if above (CF=0 and ZF=0).
test al, al
je .L2
; Do something 1
.L2:

fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; (same thing as above)
fstp st(0)
setae al ; Set al if above or equal (CF=0).
test al, al
je .L5
; Do something 2
.L5:
leave
ret

Which operator is faster ( or =), ( or =)?

it varies, first start at examining different instruction sets and how how the compilers use those instruction sets. Take the openrisc 32 for example, which is clearly mips inspired but does conditionals differently. For the or32 there are compare and set flag instructions, compare these two registers if less than or equal unsigned then set the flag, compare these two registers if equal set the flag. Then there are two conditional branch instructions branch on flag set and branch on flag clear. The compiler has to follow one of these paths, but less, than, less than or equal, greater than, etc are all going to use the same number of instructions, same execution time for a conditional branch and same execution time for not doing the conditional branch.

Now it is definitely going to be true for most architectures that performing the branch takes longer than not performing the branch because of having to flush and re-fill the pipe. Some do branch prediction, etc to help with that problem.

Now some architectures the size of the instruction may vary, compare gpr0 and gpr1 vs compare gpr0 and the immediate number 1234, may require a larger instruction, you will see this a lot with x86 for example. so although both cases may be a branch if less than how you encode the less depending on what registers happen to hold what values can make a performance difference (sure x86 does a lot of pipelining, lots of caching, etc to make up for these issues). Another similar example is mips and or32, where r0 is always a zero, it is not really a general purpose register, if you write to it it doesnt change, it is hardwired to a zero, so a compare if equal to 0 MIGHT cost you more than a compare if equal to some other number if an extra instruction or two is required to fill a gpr with that immediate so that the compare can happen, worst case is having to evict a register to the stack or memory, to free up the register to put the immediate in there so that the compare can happen.

Some architectures have conditional execution like arm, for the full arm (not thumb) instructions you can on a per instruction basis execute, so if you had code

if(i==7) j=5; else j=9;

the pseudo code for arm would be

cmp i,#7
moveq j,#5
movne j,#7

there is no actual branch, so no pipeline issues you flywheel right on through, very fast.

One architecture to another if that is an interesting comparison some as mentioned, mips, or32, you have to specifically perform some sort of instruction for the comparision, others like x86, msp430 and the vast majority each alu operation changes the flags, arm and the like change flags if you tell it to change flags otherwise dont as shown above. so a

while(--len)
{
//do something
}

loop the subtract of 1 also sets the flags, if the stuff in the loop was simple enough you could make the whole thing conditional, so you save on separate compare and branch instructions and you save in the pipeline penalty. Mips solves this a little by compare and branch are one instruction, and they execute one instruction after the branch to save a little in the pipe.

The general answer is that you will not see a difference, the number of instructions, execuition time, etc are the same for the various conditionals. special cases like small immediates vs big immediates, etc may have an effect for corner cases, or the compiler may simply choose to do it all differently depending on what comparison you do. If you try to re-write your algorithm to have it give the same answer but use a less than instead of a greater than and equal, you could be changing the code enough to get a different instruction stream. Likewise if you perform too simple of a performance test, the compiler can/will optimize out the comparison complete and just generate the results, which could vary depending on your test code causing different execution. The key to all of this is disassemble the things you want to compare and see how the instructions differ. That will tell you if you should expect to see any execution differences.

What is faster (x 0) or (x == -1)?

That depends entirely on the ISA you're compiling for, and the quality of your compiler's optimizer. Don't optimize prematurely: profile first to find your bottlenecks.

That said, in x86, you'll find that both are equally fast in most cases. In both cases, you'll have a comparison (cmp) and a conditional jump (jCC) instructions. However, for (x < 0), there may be some instances where the compiler can elide the cmp instruction, speeding up your code by one whole cycle.

Specifically, if the value x is stored in a register and was recently the result of an arithmetic operation (such as add, or sub, but there are many more possibilities) that sets the sign flag SF in the EFLAGS register, then there's no need for the cmp instruction, and the compiler can emit just a js instruction. There's no simple jCC instruction that jumps when the input was -1.

Is the inequality operator faster than the equality operator?

Usually, the microprocessor does comparison using electrical gates and not step by step like that. It checks all bits at once.

Why is === faster than == in PHP?

Because the equality operator == coerces, or converts, the data type temporarily to see if it’s equal to the other operand, whereas === (the identity operator) doesn’t need to do any converting whatsoever and thus less work is done, which makes it faster.

Why is 'x' in ('x',) faster than 'x' == 'x'?

As I mentioned to David Wolever, there's more to this than meets the eye; both methods dispatch to is; you can prove this by doing

min(Timer("x == x", setup="x = 'a' * 1000000").repeat(10, 10000))
#>>> 0.00045456900261342525

min(Timer("x == y", setup="x = 'a' * 1000000; y = 'a' * 1000000").repeat(10, 10000))
#>>> 0.5256857610074803

The first can only be so fast because it checks by identity.

To find out why one would take longer than the other, let's trace through execution.

They both start in ceval.c, from COMPARE_OP since that is the bytecode involved

TARGET(COMPARE_OP) {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *res = cmp_outcome(oparg, left, right);
Py_DECREF(left);
Py_DECREF(right);
SET_TOP(res);
if (res == NULL)
goto error;
PREDICT(POP_JUMP_IF_FALSE);
PREDICT(POP_JUMP_IF_TRUE);
DISPATCH();
}

This pops the values from the stack (technically it only pops one)

PyObject *right = POP();
PyObject *left = TOP();

and runs the compare:

PyObject *res = cmp_outcome(oparg, left, right);

cmp_outcome is this:

static PyObject *
cmp_outcome(int op, PyObject *v, PyObject *w)
{
int res = 0;
switch (op) {
case PyCmp_IS: ...
case PyCmp_IS_NOT: ...
case PyCmp_IN:
res = PySequence_Contains(w, v);
if (res < 0)
return NULL;
break;
case PyCmp_NOT_IN: ...
case PyCmp_EXC_MATCH: ...
default:
return PyObject_RichCompare(v, w, op);
}
v = res ? Py_True : Py_False;
Py_INCREF(v);
return v;
}

This is where the paths split. The PyCmp_IN branch does

int
PySequence_Contains(PyObject *seq, PyObject *ob)
{
Py_ssize_t result;
PySequenceMethods *sqm = seq->ob_type->tp_as_sequence;
if (sqm != NULL && sqm->sq_contains != NULL)
return (*sqm->sq_contains)(seq, ob);
result = _PySequence_IterSearch(seq, ob, PY_ITERSEARCH_CONTAINS);
return Py_SAFE_DOWNCAST(result, Py_ssize_t, int);
}

Note that a tuple is defined as

static PySequenceMethods tuple_as_sequence = {
...
(objobjproc)tuplecontains, /* sq_contains */
};

PyTypeObject PyTuple_Type = {
...
&tuple_as_sequence, /* tp_as_sequence */
...
};

So the branch

if (sqm != NULL && sqm->sq_contains != NULL)

will be taken and *sqm->sq_contains, which is the function (objobjproc)tuplecontains, will be taken.

This does

static int
tuplecontains(PyTupleObject *a, PyObject *el)
{
Py_ssize_t i;
int cmp;

for (i = 0, cmp = 0 ; cmp == 0 && i < Py_SIZE(a); ++i)
cmp = PyObject_RichCompareBool(el, PyTuple_GET_ITEM(a, i),
Py_EQ);
return cmp;
}

...Wait, wasn't that PyObject_RichCompareBool what the other branch took? Nope, that was PyObject_RichCompare.

That code path was short so it likely just comes down to the speed of these two. Let's compare.

int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
PyObject *res;
int ok;

/* Quick result when objects are the same.
Guarantees that identity implies equality. */
if (v == w) {
if (op == Py_EQ)
return 1;
else if (op == Py_NE)
return 0;
}

...
}

The code path in PyObject_RichCompareBool pretty much immediately terminates. For PyObject_RichCompare, it does

PyObject *
PyObject_RichCompare(PyObject *v, PyObject *w, int op)
{
PyObject *res;

assert(Py_LT <= op && op <= Py_GE);
if (v == NULL || w == NULL) { ... }
if (Py_EnterRecursiveCall(" in comparison"))
return NULL;
res = do_richcompare(v, w, op);
Py_LeaveRecursiveCall();
return res;
}

The Py_EnterRecursiveCall/Py_LeaveRecursiveCall combo are not taken in the previous path, but these are relatively quick macros that'll short-circuit after incrementing and decrementing some globals.

do_richcompare does:

static PyObject *
do_richcompare(PyObject *v, PyObject *w, int op)
{
richcmpfunc f;
PyObject *res;
int checked_reverse_op = 0;

if (v->ob_type != w->ob_type && ...) { ... }
if ((f = v->ob_type->tp_richcompare) != NULL) {
res = (*f)(v, w, op);
if (res != Py_NotImplemented)
return res;
...
}
...
}

This does some quick checks to call v->ob_type->tp_richcompare which is

PyTypeObject PyUnicode_Type = {
...
PyUnicode_RichCompare, /* tp_richcompare */
...
};

which does

PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
int result;
PyObject *v;

if (!PyUnicode_Check(left) || !PyUnicode_Check(right))
Py_RETURN_NOTIMPLEMENTED;

if (PyUnicode_READY(left) == -1 ||
PyUnicode_READY(right) == -1)
return NULL;

if (left == right) {
switch (op) {
case Py_EQ:
case Py_LE:
case Py_GE:
/* a string is equal to itself */
v = Py_True;
break;
case Py_NE:
case Py_LT:
case Py_GT:
v = Py_False;
break;
default:
...
}
}
else if (...) { ... }
else { ...}
Py_INCREF(v);
return v;
}

Namely, this shortcuts on left == right... but only after doing

    if (!PyUnicode_Check(left) || !PyUnicode_Check(right))

if (PyUnicode_READY(left) == -1 ||
PyUnicode_READY(right) == -1)

All in all the paths then look something like this (manually recursively inlining, unrolling and pruning known branches)

POP()                           # Stack stuff
TOP() #
#
case PyCmp_IN: # Dispatch on operation
#
sqm != NULL # Dispatch to builtin op
sqm->sq_contains != NULL #
*sqm->sq_contains #
#
cmp == 0 # Do comparison in loop
i < Py_SIZE(a) #
v == w #
op == Py_EQ #
++i #
cmp == 0 #
#
res < 0 # Convert to Python-space
res ? Py_True : Py_False #
Py_INCREF(v) #
#
Py_DECREF(left) # Stack stuff
Py_DECREF(right) #
SET_TOP(res) #
res == NULL #
DISPATCH() #

vs

POP()                           # Stack stuff
TOP() #
#
default: # Dispatch on operation
#
Py_LT <= op # Checking operation
op <= Py_GE #
v == NULL #
w == NULL #
Py_EnterRecursiveCall(...) # Recursive check
#
v->ob_type != w->ob_type # More operation checks
f = v->ob_type->tp_richcompare # Dispatch to builtin op
f != NULL #
#
!PyUnicode_Check(left) # ...More checks
!PyUnicode_Check(right)) #
PyUnicode_READY(left) == -1 #
PyUnicode_READY(right) == -1 #
left == right # Finally, doing comparison
case Py_EQ: # Immediately short circuit
Py_INCREF(v); #
#
res != Py_NotImplemented #
#
Py_LeaveRecursiveCall() # Recursive check
#
Py_DECREF(left) # Stack stuff
Py_DECREF(right) #
SET_TOP(res) #
res == NULL #
DISPATCH() #

Now, PyUnicode_Check and PyUnicode_READY are pretty cheap since they only check a couple of fields, but it should be obvious that the top one is a smaller code path, it has fewer function calls, only one switch
statement and is just a bit thinner.

TL;DR:

Both dispatch to if (left_pointer == right_pointer); the difference is just how much work they do to get there. in just does less.

In Java, can & be faster than &&?

Ok, so you want to know how it behaves at the lower level... Let's have a look at the bytecode then!

EDIT : added the generated assembly code for AMD64, at the end. Have a look for some interesting notes.

EDIT 2 (re: OP's "Update 2"): added asm code for Guava's isPowerOfTwo method as well.

Java source

I wrote these two quick methods:

public boolean AndSC(int x, int value, int y) {
return value >= x && value <= y;
}

public boolean AndNonSC(int x, int value, int y) {
return value >= x & value <= y;
}

As you can see, they are exactly the same, save for the type of AND operator.

Java bytecode

And this is the generated bytecode:

  public AndSC(III)Z
L0
LINENUMBER 8 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ILOAD 2
ILOAD 3
IF_ICMPGT L1
L2
LINENUMBER 9 L2
ICONST_1
IRETURN
L1
LINENUMBER 11 L1
FRAME SAME
ICONST_0
IRETURN
L3
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L3 0
LOCALVARIABLE x I L0 L3 1
LOCALVARIABLE value I L0 L3 2
LOCALVARIABLE y I L0 L3 3
MAXSTACK = 2
MAXLOCALS = 4

// access flags 0x1
public AndNonSC(III)Z
L0
LINENUMBER 15 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ICONST_1
GOTO L2
L1
FRAME SAME
ICONST_0
L2
FRAME SAME1 I
ILOAD 2
ILOAD 3
IF_ICMPGT L3
ICONST_1
GOTO L4
L3
FRAME SAME1 I
ICONST_0
L4
FRAME FULL [test/lsoto/AndTest I I I] [I I]
IAND
IFEQ L5
L6
LINENUMBER 16 L6
ICONST_1
IRETURN
L5
LINENUMBER 18 L5
FRAME SAME
ICONST_0
IRETURN
L7
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L7 0
LOCALVARIABLE x I L0 L7 1
LOCALVARIABLE value I L0 L7 2
LOCALVARIABLE y I L0 L7 3
MAXSTACK = 3
MAXLOCALS = 4

The AndSC (&&) method generates two conditional jumps, as expected:

  1. It loads value and x onto the stack, and jumps to L1 if value is lower. Else it keeps running the next lines.
  2. It loads value and y onto the stack, and jumps to L1 also, if value is greater. Else it keeps running the next lines.
  3. Which happen to be a return true in case none of the two jumps were made.
  4. And then we have the lines marked as L1 which are a return false.

The AndNonSC (&) method, however, generates three conditional jumps!

  1. It loads value and x onto the stack and jumps to L1 if value is lower. Because now it needs to save the result to compare it with the other part of the AND, so it has to execute either "save true" or "save false", it can't do both with the same instruction.
  2. It loads value and y onto the stack and jumps to L1 if value is greater. Once again it needs to save true or false and that's two different lines depending on the comparison result.
  3. Now that both comparisons are done, the code actually executes the AND operation -- and if both are true, it jumps (for a third time) to return true; or else it continues execution onto the next line to return false.

(Preliminary) Conclusion

Though I'm not that very much experienced with Java bytecode and I may have overlooked something, it seems to me that & will actually perform worse than && in every case: it generates more instructions to execute, including more conditional jumps to predict and possibly fail at.

A rewriting of the code to replace comparisons with arithmetical operations, as someone else proposed, might be a way to make & a better option, but at the cost of making the code much less clear.

IMHO it is not worth the hassle for 99% of the scenarios (it may be very well worth it for the 1% loops that need to be extremely optimized, though).

EDIT: AMD64 assembly

As noted in the comments, the same Java bytecode can lead to different machine code in different systems, so while the Java bytecode might give us a hint about which AND version performs better, getting the actual ASM as generated by the compiler is the only way to really find out.

I printed the AMD64 ASM instructions for both methods; below are the relevant lines (stripped entry points etc.).

NOTE: all methods compiled with java 1.8.0_91 unless otherwise stated.

Method AndSC with default options

  # {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest'
...
0x0000000002923e3e: cmp %r8d,%r9d
0x0000000002923e41: movabs $0x16da0a08,%rax ; {metadata(method data for {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest')}
0x0000000002923e4b: movabs $0x108,%rsi
0x0000000002923e55: jl 0x0000000002923e65
0x0000000002923e5b: movabs $0x118,%rsi
0x0000000002923e65: mov (%rax,%rsi,1),%rbx
0x0000000002923e69: lea 0x1(%rbx),%rbx
0x0000000002923e6d: mov %rbx,(%rax,%rsi,1)
0x0000000002923e71: jl 0x0000000002923eb0 ;*if_icmplt
; - AndTest::AndSC@2 (line 22)

0x0000000002923e77: cmp %edi,%r9d
0x0000000002923e7a: movabs $0x16da0a08,%rax ; {metadata(method data for {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest')}
0x0000000002923e84: movabs $0x128,%rsi
0x0000000002923e8e: jg 0x0000000002923e9e
0x0000000002923e94: movabs $0x138,%rsi
0x0000000002923e9e: mov (%rax,%rsi,1),%rdi
0x0000000002923ea2: lea 0x1(%rdi),%rdi
0x0000000002923ea6: mov %rdi,(%rax,%rsi,1)
0x0000000002923eaa: jle 0x0000000002923ec1 ;*if_icmpgt
; - AndTest::AndSC@7 (line 22)

0x0000000002923eb0: mov $0x0,%eax
0x0000000002923eb5: add $0x30,%rsp
0x0000000002923eb9: pop %rbp
0x0000000002923eba: test %eax,-0x1c73dc0(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923ec0: retq ;*ireturn
; - AndTest::AndSC@13 (line 25)

0x0000000002923ec1: mov $0x1,%eax
0x0000000002923ec6: add $0x30,%rsp
0x0000000002923eca: pop %rbp
0x0000000002923ecb: test %eax,-0x1c73dd1(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923ed1: retq

Method AndSC with -XX:PrintAssemblyOptions=intel option

  # {method} {0x00000000170a0810} 'AndSC' '(III)Z' in 'AndTest'
...
0x0000000002c26e2c: cmp r9d,r8d
0x0000000002c26e2f: jl 0x0000000002c26e36 ;*if_icmplt
0x0000000002c26e31: cmp r9d,edi
0x0000000002c26e34: jle 0x0000000002c26e44 ;*iconst_0
0x0000000002c26e36: xor eax,eax ;*synchronization entry
0x0000000002c26e38: add rsp,0x10
0x0000000002c26e3c: pop rbp
0x0000000002c26e3d: test DWORD PTR [rip+0xffffffffffce91bd],eax # 0x0000000002910000
0x0000000002c26e43: ret
0x0000000002c26e44: mov eax,0x1
0x0000000002c26e49: jmp 0x0000000002c26e38

Method AndNonSC with default options

  # {method} {0x0000000016da0908} 'AndNonSC' '(III)Z' in 'AndTest'
...
0x0000000002923a78: cmp %r8d,%r9d
0x0000000002923a7b: mov $0x0,%eax
0x0000000002923a80: jl 0x0000000002923a8b
0x0000000002923a86: mov $0x1,%eax
0x0000000002923a8b: cmp %edi,%r9d
0x0000000002923a8e: mov $0x0,%esi
0x0000000002923a93: jg 0x0000000002923a9e
0x0000000002923a99: mov $0x1,%esi
0x0000000002923a9e: and %rsi,%rax
0x0000000002923aa1: cmp $0x0,%eax
0x0000000002923aa4: je 0x0000000002923abb ;*ifeq
; - AndTest::AndNonSC@21 (line 29)

0x0000000002923aaa: mov $0x1,%eax
0x0000000002923aaf: add $0x30,%rsp
0x0000000002923ab3: pop %rbp
0x0000000002923ab4: test %eax,-0x1c739ba(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923aba: retq ;*ireturn
; - AndTest::AndNonSC@25 (line 30)

0x0000000002923abb: mov $0x0,%eax
0x0000000002923ac0: add $0x30,%rsp
0x0000000002923ac4: pop %rbp
0x0000000002923ac5: test %eax,-0x1c739cb(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923acb: retq

Method AndNonSC with -XX:PrintAssemblyOptions=intel option

  # {method} {0x00000000170a0908} 'AndNonSC' '(III)Z' in 'AndTest'
...
0x0000000002c270b5: cmp r9d,r8d
0x0000000002c270b8: jl 0x0000000002c270df ;*if_icmplt
0x0000000002c270ba: mov r8d,0x1 ;*iload_2
0x0000000002c270c0: cmp r9d,edi
0x0000000002c270c3: cmovg r11d,r10d
0x0000000002c270c7: and r8d,r11d
0x0000000002c270ca: test r8d,r8d
0x0000000002c270cd: setne al
0x0000000002c270d0: movzx eax,al
0x0000000002c270d3: add rsp,0x10
0x0000000002c270d7: pop rbp
0x0000000002c270d8: test DWORD PTR [rip+0xffffffffffce8f22],eax # 0x0000000002910000
0x0000000002c270de: ret
0x0000000002c270df: xor r8d,r8d
0x0000000002c270e2: jmp 0x0000000002c270c0
  • First of all, the generated ASM code differs depending on whether we choose the default AT&T syntax or the Intel syntax.
  • With AT&T syntax:
    • The ASM code is actually longer for the AndSC method, with every bytecode IF_ICMP* translated to two assembly jump instructions, for a total of 4 conditional jumps.
    • Meanwhile, for the AndNonSC method the compiler generates a more straight-forward code, where each bytecode IF_ICMP* is translated to only one assembly jump instruction, keeping the original count of 3 conditional jumps.
  • With Intel syntax:
    • The ASM code for AndSC is shorter, with just 2 conditional jumps (not counting the non-conditional jmp at the end). Actually it's just two CMP, two JL/E and a XOR/MOV depending on the result.
    • The ASM code for AndNonSC is now longer than the AndSC one! However, it has just 1 conditional jump (for the first comparison), using the registers to directly compare the first result with the second, without any more jumps.

Conclusion after ASM code analysis

  • At AMD64 machine-language level, the & operator seems to generate ASM code with fewer conditional jumps, which might be better for high prediction-failure rates (random values for example).
  • On the other hand, the && operator seems to generate ASM code with fewer instructions (with the -XX:PrintAssemblyOptions=intel option anyway), which might be better for really long loops with prediction-friendly inputs, where the fewer number of CPU cycles for each comparison can make a difference in the long run.


Related Topics



Leave a reply



Submit