Performance wise, how fast are Bitwise Operators vs. Normal Modulus?
Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2
using a bitwise AND
instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).
In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.
Performance comparison of modulo operator and bitwise AND
Let's try to reproduce with JMH.
@Benchmark
@Measurement(timeUnit = TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public int first() throws IOException {
return i % 2;
}
@Benchmark
@Measurement(timeUnit = TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public int second() throws IOException {
return i & 0x1;
}
Okay, it is reproducable. The first
is slightly slower than the second
. Now let's figure out why. Run it with -prof perfnorm
:
Benchmark Mode Cnt Score Error Units
MyBenchmark.first avgt 50 2.674 ± 0.028 ns/op
MyBenchmark.first:CPI avgt 10 0.301 ± 0.002 #/op
MyBenchmark.first:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.first:L1-dcache-loads avgt 10 11.011 ± 0.146 #/op
MyBenchmark.first:L1-dcache-stores avgt 10 3.011 ± 0.034 #/op
MyBenchmark.first:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.first:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branches avgt 10 4.006 ± 0.054 #/op
MyBenchmark.first:cycles avgt 10 9.322 ± 0.113 #/op
MyBenchmark.first:dTLB-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:dTLB-loads avgt 10 10.939 ± 0.175 #/op
MyBenchmark.first:dTLB-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:dTLB-stores avgt 10 2.991 ± 0.045 #/op
MyBenchmark.first:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:iTLB-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:instructions avgt 10 30.991 ± 0.427 #/op
MyBenchmark.second avgt 50 2.263 ± 0.015 ns/op
MyBenchmark.second:CPI avgt 10 0.320 ± 0.001 #/op
MyBenchmark.second:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.second:L1-dcache-loads avgt 10 11.045 ± 0.152 #/op
MyBenchmark.second:L1-dcache-stores avgt 10 3.014 ± 0.032 #/op
MyBenchmark.second:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.second:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branches avgt 10 4.014 ± 0.045 #/op
MyBenchmark.second:cycles avgt 10 8.024 ± 0.098 #/op
MyBenchmark.second:dTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:dTLB-loads avgt 10 10.989 ± 0.161 #/op
MyBenchmark.second:dTLB-store-misses avgt 10 ≈ 10⁻⁶ #/op
MyBenchmark.second:dTLB-stores avgt 10 3.004 ± 0.042 #/op
MyBenchmark.second:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:iTLB-loads avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:instructions avgt 10 25.076 ± 0.296 #/op
Note the difference in cycles and instructions. And now that's kind of obvious. The first
does care about the sign, but the second
does not (just bitwise and). To make sure this is the reason take a look at the assembly fragment:
first:
0x00007f91111f8355: mov 0xc(%r10),%r11d ;*getfield i
0x00007f91111f8359: mov %r11d,%edx
0x00007f91111f835c: and $0x1,%edx
0x00007f91111f835f: mov %edx,%r10d
0x00007f6bd120a6e2: neg %r10d
0x00007f6bd120a6e5: test %r11d,%r11d
0x00007f6bd120a6e8: cmovl %r10d,%edx
second:
0x00007ff36cbda580: mov $0x1,%edx
0x00007ff36cbda585: mov 0x40(%rsp),%r10
0x00007ff36cbda58a: and 0xc(%r10),%edx
Are there any performance costs associated with using bitwise operators?
The only way you can do it is by measuring - although I doubt doing bitwise operations in JavaScript will get you much over an array/object.
I'd try:
var len = 100000;
var bitwise = 0xdeadbeef;
var arrays = [1,1,0,1,1,1,1,0,1,0,1,0,1,1,0,1,1,0,1,1,1,1,1,0,1,1,1,0,1,1,1,1];
var start = new Date().getTime();
for (var i = 0; i < len; i++) {
arrays[i % 32]; //Uncomment to test this
//bitwise & (1 << (i % 32)); //Uncomment to test this
}
var end = new Date().getTime();
console.log((end-start)/len +"ms per access");
I got (Chrome, Windows 7, Core i7 920, 6 GB RAM):
0.0015 ms per access for arrays
0.0016 ms per access for bit shifting.
You make the decision - it is almost insignificant. Use Arrays for slightly better speed, use bitwise packing if space (bandwidth) is to be conserved.
Bitwise and in place of modulus operator
First of all, it's actually not accurate to say that
x % 2 == x & 1
Simple counterexample: x = -1
. In many languages, including Java, -1 % 2 == -1
. That is, %
is not necessarily the traditional mathematical definition of modulo. Java calls it the "remainder operator", for example.
With regards to bitwise optimization, only modulo powers of two can "easily" be done in bitwise arithmetics. Generally speaking, only modulo powers of base b can "easily" be done with base b representation of numbers.
In base 10, for example, for non-negative N
, N mod 10^k
is just taking the least significant k
digits.
References
- JLS 15.17.3 Remainder Operator %
- Wikipedia/Modulo Operation
Why were bitwise operations slightly faster than addition/subtraction operations on older microprocessors?
In any binary bitwise operation, each output bit depends only on the two corresponding bits in the inputs. In an add operation, each output bit depends on the corresponding bits in the inputs and all the bits to the right (toward lower values).
For example, the leftmost bit of 01111111 + 00000001 is 1, but the leftmost bit of 01111110 + 00000001 is 0.
In its simplest form, an adder adds the two low bits and produces one output bit and a carry. Then the next two lowest bits are added, and the carry is added in, producing another output bit and another carry. This repeats. So the highest output bit is at the end of a chain of adds. If you do the operation bit by bit, as older processors did, then it takes time to get to the end.
There are ways to speed this up some, by feeding several input bits into more complicated logic arrangements. But that of course requires more area in the chip and more power.
Today‘s processors have many different units for performing various sorts of work—loads, stores, addition, multiplication, floating-point operations, and more. Given today’s capabilities, the work of doing an add is small compared to other tasks, so it fits within a single processor cycle.
Perhaps in theory you could make a processor that did a bitwise operation more quickly than an add. (And there are, at least on paper, exotic processors that operate asynchronously, with different units doing work at their own paces.) However, with the designs in use, you need some regular fixed cycle to coordinate many things in the processor—loading instructions, dispatching them to execution units, sending results from execution units to registers, and much, much more. Some execution units do require multiple cycles to complete their jobs (e.g., some floating-point units take about four cycles to do a floating-point add). So you can have a mix. However, with current scales, making the cycle time smaller so that it fits a bitwise operation but not an add is likely not economical.
Why are bitwise operators slower than multiplication/division/modulo?
*
, %
, and /
all have fast paths for single-"limb" integers. <<
, >>
, and &
don't. They're going through the general-purpose arbitrary-precision code path.
Are bitwise operators on different int widths in C well defined?
As per C11
standard, chapter §6.5.12, Bitwise inclusive OR operator
Each of the operands shall have integer type.
and
The usual arithmetic conversions are performed on the operands.
So, the bitwise operation should be fine.
However, in case of printf()
, the %d
expects int
argument and you're supplying unsigned int
value. That is undefined behavior.
You can use PRIu32
macro to print uint32_t
from inttypes.h
.
Why bitwise operators are popular in sport programming
Code golf isn't focused on bitwise operators, is about code length.
Bitwise operators are not necessary faster, but generally fast enough. They are concise (and usually harder to read, but that's side effect).
~~
is a shorter (and usually more preformant) alternative to parseInt
with a considerable number of remarks. In regular (non-golf) code it should be used only if it provides the behaviour that is more desirable than parseInt
or in performance-sensitive context.
~a
is roughly equal to parseInt(a) * -1 - 1
. It can be used as a shorter alternative to ~~a
in this particular example, s(b) - s(a)
, because * -1 - 1
part is eliminated on subtraction (the sign should be taken into account).
Any significant performance improvement by using bitwise operators instead of plain int sums in C#?
The inline if
operator (?
, :
) will generate nearly the same IL as the standard if
list in the second example. The only difference you will see here are the particular operations the processor will be doing, and I can bet that ADD
is quicker than SHL
.
Since you're going to be adding the results anyway, I would opt for the second example (plus it makes it far easier to read).
EDIT
I just checked the IL of both examples, and it goes against what I have said above.
The first example generates far less IL (34 lines less) so you'll have to run a performance test to really determine if it is faster too.
Related Topics
What Does an Ampersand After This Assignment Operator Mean
C++ Trying to Swap Values in a Vector
Using Std::Move() When Returning a Value from a Function to Avoid to Copy
Eclipse C++:"Program "G++" Not Found in Path"
Erase/Remove Contents from the Map (Or Any Other Stl Container) While Iterating It
How to Access Variables Defined and Declared in One Function in Another Function
How to Break Shared_Ptr Cyclic Reference Using Weak_Ptr
Is It Wise to Ignore Gcc/Clang's "-Wmissing-Braces" Warning
Interesting Behavior of Compiler with Namespaces
Namespaces and Operator Overloading in C++
Utilizing C++ in iOS and MAC Os X Applications
Switch "Transfer of Control Bypasses Initialization Of:" When Calling a Function
How to Pass Derived Classes by Reference to a Function Taking Base Class as a Parameter
Fastest Inline-Assembly Spinlock
Win32: How to Hide 3Rd Party Windows in Taskbar by Hwnd