Simple Python Challenge: Fastest Bitwise Xor on Data Buffers

Simple Python Challenge: Fastest Bitwise XOR on Data Buffers

First Try

Using scipy.weave and SSE2 intrinsics gives a marginal improvement. The first invocation is a bit slower since the code needs to be loaded from the disk and cached, subsequent invocations are faster:

import numpy
import time
from os import urandom
from scipy import weave

SIZE = 2**20

def faster_slow_xor(aa,bb):
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b)
    return b.tostring()

code = """
const __m128i* pa = (__m128i*)a;
const __m128i* pend = (__m128i*)(a + arr_size);
__m128i* pb = (__m128i*)b;
__m128i xmm1, xmm2;
while (pa < pend) {
  xmm1 = _mm_loadu_si128(pa); // must use unaligned access 
  xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries
  _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2));
  ++pa;
  ++pb;
}
"""

def inline_xor(aa, bb):
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    arr_size = a.shape[0]
    weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"'])
    return b.tostring()

Second Try

Taking into account the comments, I revisited the code to find out if the copying could be avoided. Turns out I read the documentation of the string object wrong, so here goes my second try:

support = """
#define ALIGNMENT 16
static void memxor(const char* in1, const char* in2, char* out, ssize_t n) {
    const char* end = in1 + n;
    while (in1 < end) {
       *out = *in1 ^ *in2;
       ++in1; 
       ++in2;
       ++out;
    }
}
"""

code2 = """
PyObject* res = PyString_FromStringAndSize(NULL, real_size);

const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT;
const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT;

memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head);

const __m128i* pa = (__m128i*)((char*)a + head);
const __m128i* pend = (__m128i*)((char*)a + real_size - tail);
const __m128i* pb = (__m128i*)((char*)b + head);
__m128i xmm1, xmm2;
__m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head);
while (pa < pend) {
    xmm1 = _mm_loadu_si128(pa);
    xmm2 = _mm_loadu_si128(pb);
    _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2));
    ++pa;
    ++pb;
    ++pc;
}
memxor((const char*)pa, (const char*)pb, (char*)pc, tail);
return_val = res;
Py_DECREF(res);
"""

def inline_xor_nocopy(aa, bb):
    real_size = len(aa)
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.frombuffer(bb, dtype=numpy.uint64)
    return weave.inline(code2, ["a", "b", "real_size"], 
                        headers = ['"emmintrin.h"'], 
                        support_code = support)

The difference is that the string is allocated inside the C code. It's impossible to have it aligned at a 16-byte-boundary as required by the SSE2 instructions, therefore the unaligned memory regions at the beginning and the end are copied using byte-wise access.

The input data is handed in using numpy arrays anyway, because weave insists on copying Python str objects to std::strings. frombuffer doesn't copy, so this is fine, but the memory is not aligned at 16 byte, so we need to use _mm_loadu_si128 instead of the faster _mm_load_si128.

Instead of using _mm_store_si128, we use _mm_stream_si128, which will make sure that any writes are streamed to main memory as soon as possible---this way, the output array does not use up valuable cache lines.

Timings

As for the timings, the slow_xor entry in the first edit referred to my improved version (inline bitwise xor, uint64), I removed that confusion. slow_xor refers to the code from the original questions. All timings are done for 1000 runs.

slow_xor: 1.85s (1x)
faster_slow_xor: 1.25s (1.48x)
inline_xor: 0.95s (1.95x)
inline_xor_nocopy: 0.32s (5.78x)

The code was compiled using gcc 4.4.3 and I've verified that the compiler actually uses the SSE instructions.

fast XORing bytes in python 3

When XORing bytes objects with one million elements each, this loop creates roughly one million temporary bytes objects and copies each byte, on average, roughly 500 thousand times from one temporary bytes to the next. Note that the exact same problem exists for strings (in many other languages, too). The string solution is to create a list of string parts and use ''.join at the end to concatenate them efficiently. You can do the same thing with bytes:

def bxor(b1, b2): # use xor for bytes
    parts = []
    for b1, b2 in zip(b1, b2):
        parts.append(bytes([b1 ^ b2]))
    return b''.join(parts)

Alternatively, you can use a bytearray which is mutable and can therefore avoid the problem. It also allows you to not allocate a new bytes object on every iteration, you can just append the byte/int.

def bxor(b1, b2): # use xor for bytes
    result = bytearray()
    for b1, b2 in zip(b1, b2):
        result.append(b1 ^ b2)
    return result

You can alternatively return bytes(result) if you want/need a bytes object.

how to do bitwise exclusive or of two strings in python?

You can convert the characters to integers and xor those instead:

l = [ord(a) ^ ord(b) for a,b in zip(s1,s2)]

Here's an updated function in case you need a string as a result of the XOR:

def sxor(s1,s2):    
    # convert strings to a list of character pair tuples
    # go through each tuple, converting them to ASCII code (ord)
    # perform exclusive or on the ASCII code
    # then convert the result back to ASCII (chr)
    # merge the resulting array of characters as a string
    return ''.join(chr(ord(a) ^ ord(b)) for a,b in zip(s1,s2))

See it working online: ideone

Bit-operations on large number of bytes

No matter what, it appears that each byte would have to be

read from memory,
modified in some fashion, and
written back to memory.

You can save a bit (no pun intended) of time by operating on multiple bytes at a time, for example by performing the XOR operation on 4, or even 8 bytes integers, hence dividing the overhead associated with the management of the loop by, roughly, a factor of 4 or 8, but this improvement would likely not amount to a significant gain for the overall algorithm.

Additional improvements can be found by replacing the "native" bit operations (XOR, Shifts, Rotations and the like) of the CPU/Language by reading pre-computed values in a table. Beware however that these native operations are typically rather optimized, and that you must be very diligent in designing the equivalent operations externally, and in measuring precisely the relative performance of these operations.

Edit: Oops, I just noted the [Python] tag, and also the reference to numpy in another response.

Beware... while the Numpy bitwise array suggestion is plausible, it all depends on actual parameters of the problem at hand. For example a fair amount of time may be lost in lining up the underlying arrays implied by numpy's bitwise function.
See this Stack Overflow question which seems quite relevant. While focused on the XOR operation, this question provides quite a few actionable hints for both improving loops etc. and for profiling in general.

xor each byte with 0x71

If you want to treat something as an array of bytes, then usually you want a bytearray as it behaves as a mutable array of bytes:

b = bytearray(open('a.out', 'rb').read())
for i in range(len(b)):
    b[i] ^= 0x71
open('b.out', 'wb').write(b)

Indexing a byte array returns an integer between 0x00 and 0xff, and modifying in place avoid the need to create a list and join everything up again. Note also that the file was opened as binary ('rb') - in your example you use 'r' which isn't a good idea.

Python bitwise operation on large binary strings

I'm guessing that you do not like the idea of using integers because it obfuscates your underlying data. Plus it makes it difficult to work with strings that start with '0' (because they trimmed off when converting to an integer), not to mention the subtleties of signed integers and endianness.

Try using the bitarray module, can be installed with pip: pip install bitarray.

from bitarray import bitarray
ba1 = bitarray('0' + '1'*100)
ba2 = bitarray('1' + '0'*100)

len(ba1)  # 101
len(ba2)  # 101
ba1[0]    # False
ba2[0]    # True

ba1 | ba2  # bitarray('1111111111.......)

# get your string back
ba1.to01()  # "01111111......."

I can't speak for the efficiency. But at least it feels clear as to what you are working with.

Also works in python3

Docs: https://pypi.python.org/pypi/bitarray/0.8.1

Simple Python Challenge: Fastest Bitwise Xor on Data Buffers