Branchless Code That Maps Zero, Negative, and Positive to 0, 1, 2

Branchless code that maps zero, negative, and positive to 0, 1, 2

int Compare(int x, int y) {
return (x < y) + (y < x) << 1;
}

Edit: Bitwise only? Guess < and > don't count, then?

int Compare(int x, int y) {
int diff = x - y;
return (!!diff) | (!!(diff & 0x80000000) << 1);
}

But there's that pesky -.

Edit: Shift the other way around.

Meh, just to try again:

int Compare(int x, int y) {
int diff = y - x;
return (!!diff) << ((diff >> 31) & 1);
}

But I'm guessing there's no standard ASM instruction for !!. Also, the << can be replaced with +, depending on which is faster...

Bit twiddling is fun!

Hmm, I just learned about setnz.

I haven't checked the assembler output (but I did test it a bit this time), and with a bit of luck it could save a whole instruction!:

IN THEORY. MY ASSEMBLER IS RUSTY

subl  %edi, %esi
setnz %eax
sarl $31, %esi
andl $1, %esi
sarl %eax, %esi
mov %esi, %eax
ret

Rambling is fun.

I need sleep.

n is negative, positive or zero? return 1, 2, or 4

First, if this variable is to be updated after (nearly) every instruction, the obvious piece of advice is this:

don't

Only update it when the subsequent instructions need its value. At any other time, there's no point in updating it.

But anyway, when we update it, what we want is this behavior:

R < 0  => CR0 == 0b001 
R > 0 => CR0 == 0b010
R == 0 => CR0 == 0b100

Ideally, we won't need to branch at all. Here's one possible approach:

  1. Set CR0 to the value 1. (if you really want speed, investigate whether this can be done without fetching the constant from memory. Even if you have to spend a couple of instructions on it, it may well be worth it)
  2. If R >= 0, left shift by one bit.
  3. If R == 0, left shift by one bit

Where steps 2 and 3 can be transformed to eliminate the "if" part

CR0 <<= (R >= 0);
CR0 <<= (R == 0);

Is this faster? I don't know. As always, when you are concerned about performance, you need to measure, measure, measure.

However, I can see a couple of advantages of this approach:

  1. we avoid branches completely
  2. we avoid memory loads/stores.
  3. the instructions we rely on (bit shifting and comparison) should have low latency, which isn't always the case for multiplication, for example.

The downside is that we have a dependency chain between all three lines: Each modifies CR0, which is then used in the next line. This limits instruction-level parallelism somewhat.

To minimize this dependency chain, we could do something like this instead:

CR0 <<= ((R >= 0) + (R == 0));

so we only have to modify CR0 once, after its initialization.

Or, doing everything in a single line:

CR0 = 1 << ((R >= 0) + (R == 0));

Of course, there are a lot of possible variations of this theme, so go ahead and experiment.

Get 1, 0, -1 as positive, zero, or negative for an integer with math only

col/max(1, abs(col))

Ugly but works. For integers, that is. For floating point values where there's no well-defined smallest positive value, you're stuck unless the language allows you to look into it as a bit sequence, then you can just do the same with the sign flag and the significand.

Whether this helps optimising anything is highly debatable though. It certainly makes things harder to read.

Branchless avoid division by zero

Assuming you have exceptions turned off, it's a simple matter to mask the result:

xorpd %xmm2, %xmm2
cmpneqsd %xmm1, %xmm2
divsd %xmm1, %xmm0
andpd %xmm2, %xmm0
ret

How to speed up array intersection in Java?

I don't think there's much you could do to improve that performance of that Java code. However, I would note that it is not doing the same thing as the C version. The C version is putting the intersection into an array that was preallocated by the caller. The Java version allocates the array itself ... and then reallocates and copies to a smaller array when it is finished.

I guess, you could change the Java version to make two passes over the input arrays, with the first one working out how big the input array needs to be ... but whether it helps or hinders will depend on the inputs.

There might be other special cases you could optimize for; e.g. if there were likely to be long runs of numbers in one array with nothing in that range in the other array you might be able to "optimistically" try to skip multiple numbers in one go; i.e. increment i or j by a larger number than 1.


But they mention that "it is possible to improve this approach using branchless implementation". How one would do it? Using switch?

Not a Java switch ... or a conditional expression because they both involve branches when translated to the native code.

I think he is referring to something like this: Branchless code that maps zero, negative, and positive to 0, 1, 2

FWIW it is a bad idea to try to do this kind of thing in Java. The problem is that the performance of tricky code sequences like that is dependent on details of the hardware architecture, instruction set, clock counts, etc that vary from one platform to the next. The Java JIT compiler's optimizer can do a pretty good job of optimizing your code ... but if you include tricky sequences:

  1. it is not at all obvious or predictable how they will be translated to native code, and
  2. you may well find that the trickiness actually inhibits useful optimizations that the JIT compiler might otherwise be able to do.

Having said that, it is not impossible that some future release of Java might include a superoptimizer ... along the lines of the one mentioned on the linked Q&A above ... that would be able to generate branchless sequences automatically. But bear in mind that superoptimization is very expensive to perform.

From a given number, set a var to -1 if negative and to 1 if positive (without using if)

Assuming you're using something like C or C++ that will convert true to 1 and false to 0, you could use: direction = (distY > 0) - (distY < 0);. You didn't say what you wanted when distY=0 -- this gives 0 (which seems like the obvious choice to me, but who knows).

Of course, there's no guarantee it'll do any good -- that will depend on a combination of compiler, compiler flags, CPU, and maybe even the phase of the moon. OTOH, I'd guess it has more chance of doing good than harm anyway.



Related Topics



Leave a reply



Submit