Why Does Multiprocessing Use Only a Single Core After I Import Numpy

Why does multiprocessing use only a single core after I import numpy?

After some more googling I found the answer here.

It turns out that certain Python modules (numpy, scipy, tables, pandas, skimage...) mess with core affinity on import. As far as I can tell, this problem seems to be specifically caused by them linking against multithreaded OpenBLAS libraries.

A workaround is to reset the task affinity using

os.system("taskset -p 0xff %d" % os.getpid())

With this line pasted in after the module imports, my example now runs on all cores:

htop_workaround

My experience so far has been that this doesn't seem to have any negative effect on numpy's performance, although this is probably machine- and task-specific .

Update:

There are also two ways to disable the CPU affinity-resetting behaviour of OpenBLAS itself. At run-time you can use the environment variable OPENBLAS_MAIN_FREE (or GOTOBLAS_MAIN_FREE), for example

OPENBLAS_MAIN_FREE=1 python myscript.py

Or alternatively, if you're compiling OpenBLAS from source you can permanently disable it at build-time by editing the Makefile.rule to contain the line

NO_AFFINITY=1

Multiprocessing only utilizing a single core

Reason why it only use one Process is simple. You only passed 1-length list in Pool.map.

What Pool(n).map(function, iterable) does is, applying provided funtion to each element of provided iterable(in this case, list) with n number of worker processes.

Since you only have 128 in nums it's only creating one task thus no other processes are ever used.

Proper usage would look like this:

from multiprocessing import pool


def work(val):
    return hash(val ** 10000000)


if __name__ == '__main__':
    p = pool.Pool(8)

    with p:
        output = p.map(work, [i for i in range(1, 31)])

    print(output)

This example will have 8 processes, so utilizing 8 logical cores. Since we gave it numbers from 1 to 30, p.map will apply function work to each of those numbers using 8 processes, so it can run up to 8 functions simultaneously.

When we run it, we can see it's effect.

Sample Image

Of course it's using more processes to communicates between process, etc - I admit that I don't know underlying details on it.

Side note, to make your work more efficient, you should refrain yourself from appending as much as possible.

Check this simple example appending a lot of time and measuring time it took.

>>> import timeit
>>> def appending():
...     output = []
...     for i in range(1000000):
...         output.append(i)
...     return output
... 
>>> def list_comp():
...     return [i for i in range(1000000)]

>>> print(f"{timeit.timeit(appending, number=100):.2}")
8.1

>>> print(f"{timeit.timeit(list_comp, number=100):.2}")
5.2

As you can see appending is pretty much more slower than List comprehension - but this doesn't mean not to use list.append - just don't use it excessively.

Why does my multiprocessing job in Python take longer than a single process?

I suspect that (perhaps due to memory overhead) your program is simply not CPU bound when running multiple copies, therefore process parallelism is not speeding it up as much as you might think.

An easy way to test this is just to have something else do the parallelism and see if anything is improved.

With N = 8 it took 31.2 seconds on my machine. With N = 1 my machine took 7.2 seconds. I then just fired off 8 copies of the N = 1 version.

$ for i in $(seq 8) ; do python paralleltest &  done
...
Time taken =  32.07925891876221
Done
Time taken =  33.45247411727905
Done
Done
Done
Done
Done
Done
Time taken =  34.14084982872009
Time taken =  34.21410894393921
Time taken =  34.44455814361572
Time taken =  34.49029612541199
Time taken =  34.502259969711304
Time taken =  34.516881227493286

I also tried this with the entire multiprocessing stuff removed and just a single call to f(0) and the results were quite similar. So the python multiprocessing is doing what it can, but this job has constraints beyond just CPU.

When I replaced the code with something less memory intensive (math.factorial(1400000)), then the parallelism looked a lot better. 14.0 seconds for 1 copy, 16.44 seconds for 8 copies.

SLURM has the ability to make use of multiple hosts as well (depending on your configuration). The multiprocessing pool does not. This method will be limited to the resources on a single host.

multiprocessing is always worse than single process no matter how many

Numpy changes how the parent process runs so that it only runs on one core. You can call os.system("taskset -p 0xff %d" % os.getpid()) after you import numpy to reset the CPU affinity so that all cores are used.

See this question for more details

Multiprocessing: use only the physical cores?

I found a solution that doesn't involve changing the source code of a python module. It uses the approach suggested here. One can check that only
the physical cores are active after running that script by doing:

lscpu

in the bash returns:

CPU(s):                8
On-line CPU(s) list:   0,2,4,6
Off-line CPU(s) list:  1,3,5,7
Thread(s) per core:    1

[One can run the script linked above from within python]. In any case, after running the script above, typing these commands in python:

import multiprocessing
multiprocessing.cpu_count()

returns 4.

Why Does Multiprocessing Use Only a Single Core After I Import Numpy