Limit Number of Threads in Numpy

Limit number of threads in numpy

There are a few common multi CPU libraries that are used for numerical computations, including inside of NumPy. There are a few environment flags that you can set before running the script to limit the number of CPUS that they use.

Try setting all of the following:

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1

Sometimes it's a bit tricky to see where exactly multithreading is introduced.

Other answers show environment flags for other libraries. They may also work.

What am I setting when I limit the number of threads?

(This might be better as a comment, feel free to remove this if a better answer comes up, as it's based on my experience using the libraries.)

I had a similar issue when multiprocessing parts of my code. The numpy/scipy libraries appear to spin up extra threads when you do vectorised operations if you compiled the libraries with BLAS or MKL (or if the conda repo you pulled them from also included a BLAS/MKL library), to accelerate certain calculations.

This is fine when running your script in a single process, since it will spawn threads up to the number specified by OPENBLAS_NUM_THREADS or MKL_NUM_THREADS (depending on if you have a BLAS library or MKL library - you can identify which by using numpy.__config__.show()), but if you are explicitly using a multiprocesing.Pool, then you likely want to control the number of processes in multiprocessing - in this case, it makes sense to set n=1 (before importing numpy & scipy), or some small number to make sure you are not oversubscribing:

n = '1'
os.environ["OMP_NUM_THREADS"] = n
os.environ["MKL_NUM_THREADS"] = n

If you set multiprocessing.Pool(processes=4), it will use 4*n processes (n threads in each process). In your case, it seems like you have a pool of 4 processes and they fire up 4 threads each, hence the 16 python processes.

The htop output gives 100% assuming a single CPU per core. As a Linux machine interprets a thread as a CPU (I might be wrong in the terminology here), if you have 4 threads per CPU, it means that the full load is actually 400%. This might not be maxed out, depending on the operations being performed (and on caching, as your machine looks hyperthreaded).

So if you're doing the numpy/scipy operation in parts of the code which are in a single process/single thread, you are better off setting a larger n, but for the multiprocessing sections, it might be better to set a larger pool and single or small n. Unfortunately, you can only set this once, at the beginning of your script if you're passing in flags through the environmental flags. If you want to set it dynamically, I saw in a numpy issues discussion somewhere that you should use threadpoolctl (I'll add a link if I can find it again).

Setting number of threads in python

It works but you have to set the environment variables before the first time you load a module in your script (including any the sub modules)

To be safe you should do this in your main program before any other import

import os
nthreads = 1
os.environ["OMP_NUM_THREADS"] = str(nthreads)
os.environ["OPENBLAS_NUM_THREADS"] = str(nthreads)
os.environ["MKL_NUM_THREADS"] = str(nthreads)
import numpy

An alternative is to set the environment variables before running the script as suggested by this answer.

One thing I thought was that reloading the module using import lib could do the trick to allow setting it dynamically, but no, it doesn't work.

If you are can use pytorch instead of numpy then you may use torch.set_num_threads that is effective, check an example usage here.

numpy OpenBLAS set maximum number of threads

Firstly, I don't really understand what you mean by 'OpenBLAS + IntelMKL'. Both of those are BLAS libraries, and numpy should only link to one of them at runtime. You should probably check which of these two numpy is actually using. You can do this by calling:

$ ldd <path-to-site-packages>/numpy/core/_dotblas.so

Update: numpy/core/_dotblas.so was removed in numpy v1.10, but you can check the linkage of numpy/core/multiarray.so instead.

For example, I link against OpenBLAS:

...
libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007f788c934000)
...

If you are indeed linking against OpenBLAS, did you build it from source? If you did, you should see that in the Makefile.rule there is a commented option:

...
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by the the script.
# NUM_THREADS = 24
...

By default OpenBLAS will try to set the maximum number of threads to use automatically, but you could try uncommenting and editing this line yourself if it is not detecting this correctly.

Also, bear in mind that you will probably see diminishing returns in terms of performance from using more threads. Unless your arrays are very large it is unlikely that using more than 6 threads will give much of a performance boost because of the increased overhead involved in thread creation and management.

Set max number of threads at runtime on numpy/openblas

You can do this by calling the openblas_set_num_threads function using ctypes. I often find myself wanting to do this, so I wrote a little context manager:

import contextlib
import ctypes
from ctypes.util import find_library

# Prioritize hand-compiled OpenBLAS library over version in /usr/lib/
# from Ubuntu repos
try_paths = ['/opt/OpenBLAS/lib/libopenblas.so',
'/lib/libopenblas.so',
'/usr/lib/libopenblas.so.0',
find_library('openblas')]
openblas_lib = None
for libpath in try_paths:
try:
openblas_lib = ctypes.cdll.LoadLibrary(libpath)
break
except OSError:
continue
if openblas_lib is None:
raise EnvironmentError('Could not locate an OpenBLAS shared library', 2)

def set_num_threads(n):
"""Set the current number of threads used by the OpenBLAS server."""
openblas_lib.openblas_set_num_threads(int(n))

# At the time of writing these symbols were very new:
# https://github.com/xianyi/OpenBLAS/commit/65a847c
try:
openblas_lib.openblas_get_num_threads()
def get_num_threads():
"""Get the current number of threads used by the OpenBLAS server."""
return openblas_lib.openblas_get_num_threads()
except AttributeError:
def get_num_threads():
"""Dummy function (symbol not present in %s), returns -1."""
return -1
pass

try:
openblas_lib.openblas_get_num_procs()
def get_num_procs():
"""Get the total number of physical processors"""
return openblas_lib.openblas_get_num_procs()
except AttributeError:
def get_num_procs():
"""Dummy function (symbol not present), returns -1."""
return -1
pass

@contextlib.contextmanager
def num_threads(n):
"""Temporarily changes the number of OpenBLAS threads.

Example usage:

print("Before: {}".format(get_num_threads()))
with num_threads(n):
print("In thread context: {}".format(get_num_threads()))
print("After: {}".format(get_num_threads()))
"""
old_n = get_num_threads()
set_num_threads(n)
try:
yield
finally:
set_num_threads(old_n)

You can use it like this:

with num_threads(8):
np.dot(x, y)

As mentioned in the comments, openblas_get_num_threads and openblas_get_num_procs were very new features at the time of writing, and might therefore not be available unless you compiled OpenBLAS from the latest version of the source code.

How do you stop numpy from multithreading?

Set the MKL_NUM_THREADS environment variable to 1. As you might have guessed, this environment variable controls the behavior of the Math Kernel Library which is included as part of Enthought's numpy build.

I just do this in my startup file, .bash_profile, with export MKL_NUM_THREADS=1. You should also be able to do it from inside your script to have it be process specific.

How to limit number of CPU's used by a python script w/o terminal or multiprocessing library?

I solved the problem in the example code given in the original question by setting BLAS environmental variables (from this link). But this is not the answer to my actual question. My first try (second update) was wrong. I needed to set the number of threads not before importing the numpy library but before the library (IncrementalPCA) importing the numpy.

So, what was the problem in the example code? It wasn't an actual problem but a feature of BLAS library used by numpy library. Trying to limit it with multiprocessing library didn't work because by default OpenBLAS is set to use all available threads.

Credits: @Amir and @Darkonaut
Sources: OpenBLAS 1, OpenBLAS 2, Solution

import os
os.environ["OMP_NUM_THREADS"] = "1" # export OMP_NUM_THREADS=1
os.environ["OPENBLAS_NUM_THREADS"] = "1" # export OPENBLAS_NUM_THREADS=1
os.environ["MKL_NUM_THREADS"] = "1" # export MKL_NUM_THREADS=1
os.environ["VECLIB_MAXIMUM_THREADS"] = "1" # export VECLIB_MAXIMUM_THREADS=1
os.environ["NUMEXPR_NUM_THREADS"] = "1" # export NUMEXPR_NUM_THREADS=1
from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA

import numpy as np

X, _ = load_digits(return_X_y=True)

#Copy-paste and increase the size of the dataset to see the behavior at htop.
for _ in range(8):
X = np.vstack((X, X))

print(X.shape)
transformer = IncrementalPCA(n_components=7, batch_size=200)

transformer.partial_fit(X[:100, :])

X_transformed = transformer.fit_transform(X)

print(X_transformed.shape)

But you can explicitly set the correct BLAS environment by checking which one
is used by your numpy build like this:

>>>import numpy as np
>>>np.__config__.show()

Gave these results...

blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]

...meaning OpenBLAS is used by my numpy build. And all I need to write is os.environ["OPENBLAS_NUM_THREADS"] = "2" in order to limit thread usage by the numpy library.

Is there a way to specify a maximum number of cores for a single Process with python multiprocessing?

It's up to your function to determine how many cores it uses (or rather is able to take advantage of), not your Process instance

I think this is something of an inversion to how you're designing, but processes are not locked to any or even some set of cores other than how the operating system assigns them, generally providing them time on the resources it manages (CPU, memory, network, disk..), which they practically request via its API.

Unless your process goes out of its way to take advantage of multiple cores, it will only be able to take advantage of at most one core (thread) and likely less overall work than the core possibly can.

However, many 3rd party libraries (such as NumPy) will take advantage of more cores unless otherwise instructed by creating additional threads, normally as many as the system supports by default. If this is the case, you can adjust the number of threads they use, usually via arguments or environmental variables.

Take a look at these resources

  • Limit number of threads in numpy
  • What's the difference between ThreadPool vs Pool in the multiprocessing module?

Once created, you could set operating system-level restrictions (cgroup settings for most Linux-like systems) on your new process to change how it's scheduled if the parent process has sufficient permissions to do so, but this may result in worse performance than expected (for example, if you create 8 threads, but restrict usage to 2 cores, overhead time will be wasted switching between threads that are not able to get more value)

See also

  • Using Cgroups to limit cpu usage
  • Niceness (nice on Wikipedia)


Related Topics



Leave a reply



Submit