Simple Python Challenge: Fastest Bitwise XOR on Data Buffers
First Try
Using scipy.weave
and SSE2 intrinsics gives a marginal improvement. The first invocation is a bit slower since the code needs to be loaded from the disk and cached, subsequent invocations are faster:
import numpy
import time
from os import urandom
from scipy import weave
SIZE = 2**20
def faster_slow_xor(aa,bb):
b = numpy.fromstring(bb, dtype=numpy.uint64)
numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b)
return b.tostring()
code = """
const __m128i* pa = (__m128i*)a;
const __m128i* pend = (__m128i*)(a + arr_size);
__m128i* pb = (__m128i*)b;
__m128i xmm1, xmm2;
while (pa < pend) {
xmm1 = _mm_loadu_si128(pa); // must use unaligned access
xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries
_mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2));
++pa;
++pb;
}
"""
def inline_xor(aa, bb):
a = numpy.frombuffer(aa, dtype=numpy.uint64)
b = numpy.fromstring(bb, dtype=numpy.uint64)
arr_size = a.shape[0]
weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"'])
return b.tostring()
Second Try
Taking into account the comments, I revisited the code to find out if the copying could be avoided. Turns out I read the documentation of the string object wrong, so here goes my second try:
support = """
#define ALIGNMENT 16
static void memxor(const char* in1, const char* in2, char* out, ssize_t n) {
const char* end = in1 + n;
while (in1 < end) {
*out = *in1 ^ *in2;
++in1;
++in2;
++out;
}
}
"""
code2 = """
PyObject* res = PyString_FromStringAndSize(NULL, real_size);
const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT;
const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT;
memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head);
const __m128i* pa = (__m128i*)((char*)a + head);
const __m128i* pend = (__m128i*)((char*)a + real_size - tail);
const __m128i* pb = (__m128i*)((char*)b + head);
__m128i xmm1, xmm2;
__m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head);
while (pa < pend) {
xmm1 = _mm_loadu_si128(pa);
xmm2 = _mm_loadu_si128(pb);
_mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2));
++pa;
++pb;
++pc;
}
memxor((const char*)pa, (const char*)pb, (char*)pc, tail);
return_val = res;
Py_DECREF(res);
"""
def inline_xor_nocopy(aa, bb):
real_size = len(aa)
a = numpy.frombuffer(aa, dtype=numpy.uint64)
b = numpy.frombuffer(bb, dtype=numpy.uint64)
return weave.inline(code2, ["a", "b", "real_size"],
headers = ['"emmintrin.h"'],
support_code = support)
The difference is that the string is allocated inside the C code. It's impossible to have it aligned at a 16-byte-boundary as required by the SSE2 instructions, therefore the unaligned memory regions at the beginning and the end are copied using byte-wise access.
The input data is handed in using numpy arrays anyway, because weave
insists on copying Python str
objects to std::string
s. frombuffer
doesn't copy, so this is fine, but the memory is not aligned at 16 byte, so we need to use _mm_loadu_si128
instead of the faster _mm_load_si128
.
Instead of using _mm_store_si128
, we use _mm_stream_si128
, which will make sure that any writes are streamed to main memory as soon as possible---this way, the output array does not use up valuable cache lines.
Timings
As for the timings, the slow_xor
entry in the first edit referred to my improved version (inline bitwise xor, uint64
), I removed that confusion. slow_xor
refers to the code from the original questions. All timings are done for 1000 runs.
slow_xor
: 1.85s (1x)faster_slow_xor
: 1.25s (1.48x)inline_xor
: 0.95s (1.95x)inline_xor_nocopy
: 0.32s (5.78x)
The code was compiled using gcc 4.4.3 and I've verified that the compiler actually uses the SSE instructions.
fast XORing bytes in python 3
When XORing bytes
objects with one million elements each, this loop creates roughly one million temporary bytes
objects and copies each byte, on average, roughly 500 thousand times from one temporary bytes
to the next. Note that the exact same problem exists for strings (in many other languages, too). The string solution is to create a list of string parts and use ''.join
at the end to concatenate them efficiently. You can do the same thing with bytes:
def bxor(b1, b2): # use xor for bytes
parts = []
for b1, b2 in zip(b1, b2):
parts.append(bytes([b1 ^ b2]))
return b''.join(parts)
Alternatively, you can use a bytearray
which is mutable and can therefore avoid the problem. It also allows you to not allocate a new bytes
object on every iteration, you can just append the byte/int
.
def bxor(b1, b2): # use xor for bytes
result = bytearray()
for b1, b2 in zip(b1, b2):
result.append(b1 ^ b2)
return result
You can alternatively return bytes(result)
if you want/need a bytes
object.
how to do bitwise exclusive or of two strings in python?
You can convert the characters to integers and xor those instead:
l = [ord(a) ^ ord(b) for a,b in zip(s1,s2)]
Here's an updated function in case you need a string as a result of the XOR:
def sxor(s1,s2):
# convert strings to a list of character pair tuples
# go through each tuple, converting them to ASCII code (ord)
# perform exclusive or on the ASCII code
# then convert the result back to ASCII (chr)
# merge the resulting array of characters as a string
return ''.join(chr(ord(a) ^ ord(b)) for a,b in zip(s1,s2))
See it working online: ideone
Bit-operations on large number of bytes
No matter what, it appears that each byte would have to be
- read from memory,
- modified in some fashion, and
- written back to memory.
You can save a bit (no pun intended) of time by operating on multiple bytes at a time, for example by performing the XOR operation on 4, or even 8 bytes integers, hence dividing the overhead associated with the management of the loop by, roughly, a factor of 4 or 8, but this improvement would likely not amount to a significant gain for the overall algorithm.
Additional improvements can be found by replacing the "native" bit operations (XOR, Shifts, Rotations and the like) of the CPU/Language by reading pre-computed values in a table. Beware however that these native operations are typically rather optimized, and that you must be very diligent in designing the equivalent operations externally, and in measuring precisely the relative performance of these operations.
Edit: Oops, I just noted the [Python]
tag, and also the reference to numpy in another response.
Beware... while the Numpy bitwise array suggestion is plausible, it all depends on actual parameters of the problem at hand. For example a fair amount of time may be lost in lining up the underlying arrays implied by numpy's bitwise function.
See this Stack Overflow question which seems quite relevant. While focused on the XOR operation, this question provides quite a few actionable hints for both improving loops etc. and for profiling in general.
xor each byte with 0x71
If you want to treat something as an array of bytes, then usually you want a bytearray
as it behaves as a mutable array of bytes:
b = bytearray(open('a.out', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
open('b.out', 'wb').write(b)
Indexing a byte array returns an integer between 0x00 and 0xff, and modifying in place avoid the need to create a list and join everything up again. Note also that the file was opened as binary ('rb') - in your example you use 'r' which isn't a good idea.
Python bitwise operation on large binary strings
I'm guessing that you do not like the idea of using integers because it obfuscates your underlying data. Plus it makes it difficult to work with strings that start with '0' (because they trimmed off when converting to an integer), not to mention the subtleties of signed integers and endianness.
Try using the bitarray module, can be installed with pip: pip install bitarray
.
from bitarray import bitarray
ba1 = bitarray('0' + '1'*100)
ba2 = bitarray('1' + '0'*100)
len(ba1) # 101
len(ba2) # 101
ba1[0] # False
ba2[0] # True
ba1 | ba2 # bitarray('1111111111.......)
# get your string back
ba1.to01() # "01111111......."
I can't speak for the efficiency. But at least it feels clear as to what you are working with.
Also works in python3
Docs: https://pypi.python.org/pypi/bitarray/0.8.1
Related Topics
What Is the Advantage of a List Comprehension Over a for Loop
How to Prevent Python's Urllib(2) from Following a Redirect
What's the Best Way to Generate a Uml Diagram from Python Source Code
Python Convert Tuple to String
Iterate a List with Indexes in Python
Typeerror: Module._Init_() Takes at Most 2 Arguments (3 Given)
Redirecting Stdout to "Nothing" in Python
Pyserial Non-Blocking Read Loop
How to Encrypt and Decrypt a String in Python
Process to Convert Simple Python Script into Windows Executable
How to Extract Parameters from a List and Pass Them to a Function Call
How to Install Python Package from Github
How to Make Python Requests Work via Socks Proxy
Use Pytesseract Ocr to Recognize Text from an Image
Stop/Start/Pause in Python Matplotlib Animation
Random State (Pseudo-Random Number) in Scikit Learn
How to Move Pandas Data from Index to Column After Multiple Groupby