Multithreaded Blas in Python/Numpy

multithreaded blas in python/numpy

Not all of NumPy uses BLAS, only some functions -- specifically dot(), vdot(), and innerproduct() and several functions from the numpy.linalg module. Also note that many NumPy operations are limited by memory bandwidth for large arrays, so an optimised implementation is unlikely to give any improvement. Whether multi-threading can give better performance if you are limited by memory bandwidth heavily depends on your hardware.

Multi-threaded integer matrix multiplication in NumPy/SciPy

Note that while this answer gets old numpy might gain optimized integer support. Please verify if this answer still works faster on your setup.

Option 5 - Roll a custom solution: Partition the matrix product in a few sub-products and perform these in parallel. This can be relatively easy implemented with standard Python modules. The sub-products are computed with numpy.dot, which releases the global interpreter lock. Thus, it is possible to use threads which are relatively lightweight and can access the arrays from the main thread for memory efficiency.

Implementation:

import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time

def blockshaped(arr, nrows, ncols):
    """
    Return an array of shape (nrows, ncols, n, m) where
    n * nrows, m * ncols = arr.shape.
    This should be a view of the original array.
    """
    h, w = arr.shape
    n, m = h // nrows, w // ncols
    return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)

def do_dot(a, b, out):
    #np.dot(a, b, out)  # does not work. maybe because out is not C-contiguous?
    out[:] = np.dot(a, b)  # less efficient because the output is stored in a temporary array?

def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
    """
    Return the matrix product a * b.
    The product is split into nblocks * mblocks partitions that are performed
    in parallel threads.
    """
    n_jobs = nblocks * mblocks
    print('running {} jobs in parallel'.format(n_jobs))

    out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)

    out_blocks = blockshaped(out, nblocks, mblocks)
    a_blocks = blockshaped(a, nblocks, 1)
    b_blocks = blockshaped(b, 1, mblocks)

    threads = []
    for i in range(nblocks):
        for j in range(mblocks):
            th = threading.Thread(target=dot_func, 
                                  args=(a_blocks[i, 0, :, :], 
                                        b_blocks[0, j, :, :], 
                                        out_blocks[i, j, :, :]))
            th.start()
            threads.append(th)

    for th in threads:
        th.join()

    return out

if __name__ == '__main__':
    a = np.ones((4, 3), dtype=int)
    b = np.arange(18, dtype=int).reshape(3, 6)
    assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))

    a = np.random.randn(1500, 1500).astype(int)

    start = time()
    pardot(a, a, 2, 4)
    time_par = time() - start
    print('pardot: {:.2f} seconds taken'.format(time_par))

    start = time()
    np.dot(a, a)
    time_dot = time() - start
    print('np.dot: {:.2f} seconds taken'.format(time_dot))

With this implementation I get a speedup of approximately x4, which is the physical number of cores in my machine:

running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken

How do you stop numpy from multithreading?

Set the MKL_NUM_THREADS environment variable to 1. As you might have guessed, this environment variable controls the behavior of the Math Kernel Library which is included as part of Enthought's numpy build.

I just do this in my startup file, .bash_profile, with export MKL_NUM_THREADS=1. You should also be able to do it from inside your script to have it be process specific.

Getting numpy.linalg.svd and numpy matrix multiplication to use multithreadng

The main issue is that the size of the matrices is too small for threads to be really worth it on all platforms. Indeed, OpenBLAS uses OpenMP to create threads regarding the size of the matrix. Threads are generally created once but the creation can take from dozens of microseconds to dozens of milliseconds regarding the machine (typically hundreds of microseconds on a regular PC). The bigger the number of cores on the machine the bigger the number of threads to create and so the bigger the overhead. When the OpenMP thread pool is reused, there are still overheads to pay mainly due to the distribution of the work and synchronizations between threads though the overheads are generally significantly smaller (typically an order of magnitude smaller).

That being said, OpenBLAS makes clearly sub-optimal choices when the output matrix is tiny compared to the input ones (which is your case). Indeed, OpenBLAS can hardly know the parallel overhead before running the target kernel so it has to make a choice: set a threshold typically based on the size of the input matrix so to define when the kernel will be executed sequentially or with multiple threads. This is critical for very small kernel to still be fast as well as huge ones to remain competitive with other BLAS implementations. The thing is this threshold is not perfectly chosen. It looks like OpenBLAS only look the size of the output matrix which is clearly sub-optimal for "thin" matrices like in your code (eg. 50x1000000 @ 1000000x50). An empirical analysis show that the threshold is arbitrary set to 100x100 in your case: beyond this threshold, OpenBLAS use multiple threads but not otherwise. The thing is threads are already useful for significantly smaller matrices in your case on most platforms (eg. for 64x64x64x64 tensors).

This threshold is tuned by compile-time definitions like GEMM_MULTITHREAD_THRESHOLD which is used in gemm.c (or gemv.c. Note that in the code, the k dimension matters but this is not what benchmarks show on my machine (possibly due to an older version of OpenBLAS being used). You can rebuild OpenBLAS with a smaller threshold (like 1 instead of 4).

An alternative solution is to use another BLAS implementation like BLIS or the Intel MKL that should use different threshold (possibly better ones). A last solution is to implement a specific implementation to efficiently compute the matrices of your code (possibly using Numba or Cython) but BLAS implementations are heavily optimized so it is often hard to actually write a faster code (unless you are very familiar with low-level optimizations, compilers, and modern processor architectures).

why isn't numpy.mean multithreaded?

I've been looking for ways to easily multithread some of my simple analysis code since I had noticed numpy it was only using one core, despite the fact that it is supposed to be multithreaded.

Who says it's supposed to be multithreaded?

numpy is primarily designed to be as fast as possible on a single core, and to be as parallelizable as possible if you need to do so. But you still have to parallelize it.

In particular, you can operate on independent sub-objects at the same time, and slow operations release the GIL when possible—although "when possible" may not be nearly enough. Also, numpy objects are designed to be shared or passed between processes as easily as possible, to facilitate using multiprocessing.

There are some specialized methods that are automatically parallelized, but most of the core methods are not. In particular, dot is implemented on top of BLAS when possible, and BLAS is automatically parallelized on most platforms, but mean is implemented in plain C code.

See Parallel Programming with numpy and scipy for details.

So, how do you know which methods are parallelized and which aren't? And, of those which aren't, how do you know which ones can be nicely manually-threaded and which need multiprocessing?

There's no good answer to that. You can make educated guesses (X seems like it's probably implemented on top of ATLAS, and my copy of ATLAS is implicitly threaded), or you can read the source.

But usually, the best thing to do is try it and test. If the code is using 100% of one core and 0% of the others, add manual threading. If it's now using 100% of one core and 10% of the others and barely running faster, change the multithreading to multiprocessing. (Fortunately, Python makes this pretty easy, especially if you use the Executor classes from concurrent.futures or the Pool classes from multiprocessing. But you still often need to put some thought into it, and test the relative costs of sharing vs. passing if you have large arrays.)

Also, as kwatford points out, just because some method doesn't seem to be implicitly parallel doesn't mean it won't be parallel in the next version of numpy, or the next version of BLAS, or on a different platform, or even on a machine with slightly different stuff installed on it. So, be prepared to re-test. And do something like my_mean = numpy.mean and then use my_mean everywhere, so you can just change one line to my_mean = pool_threaded_mean.

Multithreaded Blas in Python/Numpy