Performance Wise, How Fast Are Bitwise Operators VS. Normal Modulus

Performance wise, how fast are Bitwise Operators vs. Normal Modulus?

Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2 using a bitwise AND instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).

In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.

Performance comparison of modulo operator and bitwise AND

Let's try to reproduce with JMH.

@Benchmark
@Measurement(timeUnit = TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public int first() throws IOException {
return i % 2;
}

@Benchmark
@Measurement(timeUnit = TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public int second() throws IOException {
return i & 0x1;
}

Okay, it is reproducable. The first is slightly slower than the second. Now let's figure out why. Run it with -prof perfnorm:

Benchmark                                 Mode  Cnt   Score    Error  Units
MyBenchmark.first avgt 50 2.674 ± 0.028 ns/op
MyBenchmark.first:CPI avgt 10 0.301 ± 0.002 #/op
MyBenchmark.first:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.first:L1-dcache-loads avgt 10 11.011 ± 0.146 #/op
MyBenchmark.first:L1-dcache-stores avgt 10 3.011 ± 0.034 #/op
MyBenchmark.first:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.first:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branches avgt 10 4.006 ± 0.054 #/op
MyBenchmark.first:cycles avgt 10 9.322 ± 0.113 #/op
MyBenchmark.first:dTLB-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:dTLB-loads avgt 10 10.939 ± 0.175 #/op
MyBenchmark.first:dTLB-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:dTLB-stores avgt 10 2.991 ± 0.045 #/op
MyBenchmark.first:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:iTLB-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:instructions avgt 10 30.991 ± 0.427 #/op
MyBenchmark.second avgt 50 2.263 ± 0.015 ns/op
MyBenchmark.second:CPI avgt 10 0.320 ± 0.001 #/op
MyBenchmark.second:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.second:L1-dcache-loads avgt 10 11.045 ± 0.152 #/op
MyBenchmark.second:L1-dcache-stores avgt 10 3.014 ± 0.032 #/op
MyBenchmark.second:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.second:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branches avgt 10 4.014 ± 0.045 #/op
MyBenchmark.second:cycles avgt 10 8.024 ± 0.098 #/op
MyBenchmark.second:dTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:dTLB-loads avgt 10 10.989 ± 0.161 #/op
MyBenchmark.second:dTLB-store-misses avgt 10 ≈ 10⁻⁶ #/op
MyBenchmark.second:dTLB-stores avgt 10 3.004 ± 0.042 #/op
MyBenchmark.second:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:iTLB-loads avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:instructions avgt 10 25.076 ± 0.296 #/op

Note the difference in cycles and instructions. And now that's kind of obvious. The first does care about the sign, but the second does not (just bitwise and). To make sure this is the reason take a look at the assembly fragment:

first:

0x00007f91111f8355: mov     0xc(%r10),%r11d   ;*getfield i
0x00007f91111f8359: mov %r11d,%edx
0x00007f91111f835c: and $0x1,%edx
0x00007f91111f835f: mov %edx,%r10d
0x00007f6bd120a6e2: neg %r10d
0x00007f6bd120a6e5: test %r11d,%r11d
0x00007f6bd120a6e8: cmovl %r10d,%edx

second:

0x00007ff36cbda580: mov     $0x1,%edx
0x00007ff36cbda585: mov 0x40(%rsp),%r10
0x00007ff36cbda58a: and 0xc(%r10),%edx

Are there any performance costs associated with using bitwise operators?

The only way you can do it is by measuring - although I doubt doing bitwise operations in JavaScript will get you much over an array/object.

I'd try:

var len = 100000;
var bitwise = 0xdeadbeef;
var arrays = [1,1,0,1,1,1,1,0,1,0,1,0,1,1,0,1,1,0,1,1,1,1,1,0,1,1,1,0,1,1,1,1];
var start = new Date().getTime();
for (var i = 0; i < len; i++) {
arrays[i % 32]; //Uncomment to test this
//bitwise & (1 << (i % 32)); //Uncomment to test this
}
var end = new Date().getTime();
console.log((end-start)/len +"ms per access");

I got (Chrome, Windows 7, Core i7 920, 6 GB RAM):

0.0015 ms per access for arrays
0.0016 ms per access for bit shifting.

You make the decision - it is almost insignificant. Use Arrays for slightly better speed, use bitwise packing if space (bandwidth) is to be conserved.

Bitwise and in place of modulus operator

First of all, it's actually not accurate to say that

x % 2 == x & 1

Simple counterexample: x = -1. In many languages, including Java, -1 % 2 == -1. That is, % is not necessarily the traditional mathematical definition of modulo. Java calls it the "remainder operator", for example.

With regards to bitwise optimization, only modulo powers of two can "easily" be done in bitwise arithmetics. Generally speaking, only modulo powers of base b can "easily" be done with base b representation of numbers.

In base 10, for example, for non-negative N, N mod 10^k is just taking the least significant k digits.

References

  • JLS 15.17.3 Remainder Operator %
  • Wikipedia/Modulo Operation

Why were bitwise operations slightly faster than addition/subtraction operations on older microprocessors?

In any binary bitwise operation, each output bit depends only on the two corresponding bits in the inputs. In an add operation, each output bit depends on the corresponding bits in the inputs and all the bits to the right (toward lower values).

For example, the leftmost bit of 01111111 + 00000001 is 1, but the leftmost bit of 01111110 + 00000001 is 0.

In its simplest form, an adder adds the two low bits and produces one output bit and a carry. Then the next two lowest bits are added, and the carry is added in, producing another output bit and another carry. This repeats. So the highest output bit is at the end of a chain of adds. If you do the operation bit by bit, as older processors did, then it takes time to get to the end.

There are ways to speed this up some, by feeding several input bits into more complicated logic arrangements. But that of course requires more area in the chip and more power.

Today‘s processors have many different units for performing various sorts of work—loads, stores, addition, multiplication, floating-point operations, and more. Given today’s capabilities, the work of doing an add is small compared to other tasks, so it fits within a single processor cycle.

Perhaps in theory you could make a processor that did a bitwise operation more quickly than an add. (And there are, at least on paper, exotic processors that operate asynchronously, with different units doing work at their own paces.) However, with the designs in use, you need some regular fixed cycle to coordinate many things in the processor—loading instructions, dispatching them to execution units, sending results from execution units to registers, and much, much more. Some execution units do require multiple cycles to complete their jobs (e.g., some floating-point units take about four cycles to do a floating-point add). So you can have a mix. However, with current scales, making the cycle time smaller so that it fits a bitwise operation but not an add is likely not economical.

Why are bitwise operators slower than multiplication/division/modulo?

*, %, and / all have fast paths for single-"limb" integers. <<, >>, and & don't. They're going through the general-purpose arbitrary-precision code path.

Are bitwise operators on different int widths in C well defined?

As per C11 standard, chapter §6.5.12, Bitwise inclusive OR operator

Each of the operands shall have integer type.

and

The usual arithmetic conversions are performed on the operands.

So, the bitwise operation should be fine.

However, in case of printf(), the %d expects int argument and you're supplying unsigned int value. That is undefined behavior.

You can use PRIu32 macro to print uint32_t from inttypes.h.

Why bitwise operators are popular in sport programming

Code golf isn't focused on bitwise operators, is about code length.

Bitwise operators are not necessary faster, but generally fast enough. They are concise (and usually harder to read, but that's side effect).

~~ is a shorter (and usually more preformant) alternative to parseInt with a considerable number of remarks. In regular (non-golf) code it should be used only if it provides the behaviour that is more desirable than parseInt or in performance-sensitive context.

~a is roughly equal to parseInt(a) * -1 - 1. It can be used as a shorter alternative to ~~a in this particular example, s(b) - s(a), because * -1 - 1 part is eliminated on subtraction (the sign should be taken into account).

Any significant performance improvement by using bitwise operators instead of plain int sums in C#?

The inline if operator (?, :) will generate nearly the same IL as the standard if list in the second example. The only difference you will see here are the particular operations the processor will be doing, and I can bet that ADD is quicker than SHL.

Since you're going to be adding the results anyway, I would opt for the second example (plus it makes it far easier to read).

EDIT
I just checked the IL of both examples, and it goes against what I have said above.

The first example generates far less IL (34 lines less) so you'll have to run a performance test to really determine if it is faster too.



Related Topics



Leave a reply



Submit