High Kernel CPU When Running Multiple Python Programs

High Kernel CPU when running multiple python programs

If the problem exists in kernel, you should narrow down a problem using a profiler such as OProfile or perf.

I.e. run perf record -a -g and than read profiling data saved into perf data using perf report. See also: linux perf: how to interpret and find hotspots.

In your case high CPU usage is caused by competition for /dev/urandom -- it allows only one thread to read from it, but multiple Python processes are doing so.

Python module random is using it only for initialization. I.e:

$ strace python -c 'import random;
while True:
    random.random()'
open("/dev/urandom", O_RDONLY)     = 4
read(4, "\16\36\366\36}"..., 2500) = 2500
close(4)                                   <--- /dev/urandom is closed

You may also explicitly ask for /dev/urandom by using os.urandom or SystemRandom class. So check your code which is dealing with random numbers.

Why is my Python app stalled with 'system' / kernel CPU time

OK. I have the answer to my own question. Yes, it's taken me over 3 months to get this far.

It appears to be GIL thrashing in Python that is the reason for the massive 'system' CPU spikes and associated pauses. Here is a good explanation of where the thrashing comes from. That presentation also pointed me in the right direction.

Python 3.2 introduced a new GIL implementation to avoid this thrashing. The result can be shown with a simple threaded example (taken from the presentation above):

from threading import Thread
import psutil

def countdown():
    n = 100000000
    while n > 0:
        n -= 1

t1 = Thread(target=countdown)
t2 = Thread(target=countdown)
t1.start(); t2.start()
t1.join(); t2.join()

print(psutil.Process().cpu_times())

On my Macbook Pro with Python 2.7.9 this uses 14.7s of 'user' CPU and 13.2s of 'system' CPU.

Python 3.4 uses 15.0s of 'user' (slightly more) but only 0.2s of 'system'.

So, the GIL is still in place, it still only runs as fast as when the code is single threaded, but it avoids all the GIL contention of Python 2 that manifests as kernel ('system') CPU time. This contention, I believe, is what was causing the issues of the original question.

Update

An additional cause to the CPU problem was found to be with OpenCV/TBB. Fully documented in this SO question.

Limit total CPU usage in python multiprocessing

The solution depends on what you want to do. Here are a few options:

Lower priorities of processes

You can nice the subprocesses. This way, though they will still eat 100% of the CPU, when you start other applications, the OS gives preference to the other applications. If you want to leave a work intensive computation run on the background of your laptop and don't care about the CPU fan running all the time, then setting the nice value with psutils is your solution. This script is a test script which runs on all cores for enough time so you can see how it behaves.

from multiprocessing import Pool, cpu_count
import math
import psutil
import os

def f(i):
    return math.sqrt(i)

def limit_cpu():
    "is called at every process start"
    p = psutil.Process(os.getpid())
    # set to lowest priority, this is windows only, on Unix use ps.nice(19)
    p.nice(psutil.BELOW_NORMAL_PRIORITY_CLASS)

if __name__ == '__main__':
    # start "number of cores" processes
    pool = Pool(None, limit_cpu)
    for p in pool.imap(f, range(10**8)):
        pass

The trick is that limit_cpu is run at the beginning of every process (see initializer argment in the doc). Whereas Unix has levels -19 (highest prio) to 19 (lowest prio), Windows has a few distinct levels for giving priority. BELOW_NORMAL_PRIORITY_CLASS probably fits your requirements best, there is also IDLE_PRIORITY_CLASS which says Windows to run your process only when the system is idle.

You can view the priority if you switch to detail mode in Task Manager and right click on the process:

Sample Image

Lower number of processes

Although you have rejected this option it still might be a good option: Say you limit the number of subprocesses to half the cpu cores using pool = Pool(max(cpu_count()//2, 1)) then the OS initially runs those processes on half the cpu cores, while the others stay idle or just run the other applications currently running. After a short time, the OS reschedules the processes and might move them to other cpu cores etc. Both Windows as Unix based systems behave this way.

Windows: Running 2 processes on 4 cores:

OSX: Running 4 processes on 8 cores:

Sample Image

You see that both OS balance the process between the cores, although not evenly so you still see a few cores with higher percentages than others.

Sleep

If you absolutely want to go sure, that your processes never eat 100% of a certain core (e.g. if you want to prevent that the cpu fan goes up), then you can run sleep in your processing function:

from time import sleep

def f(i):
    sleep(0.01)
    return math.sqrt(i)

This makes the OS "schedule out" your process for 0.01 seconds for each computation and makes room for other applications. If there are no other applications, then the cpu core is idle, thus it will never go to 100%. You'll need to play around with different sleep durations, it will also vary from computer to computer you run it on. If you want to make it very sophisticated you could adapt the sleep depending on what cpu_times() reports.

How to limit number of CPU's used by a python script w/o terminal or multiprocessing library?

I solved the problem in the example code given in the original question by setting BLAS environmental variables (from this link). But this is not the answer to my actual question. My first try (second update) was wrong. I needed to set the number of threads not before importing the numpy library but before the library (IncrementalPCA) importing the numpy.

So, what was the problem in the example code? It wasn't an actual problem but a feature of BLAS library used by numpy library. Trying to limit it with multiprocessing library didn't work because by default OpenBLAS is set to use all available threads.

Credits: @Amir and @Darkonaut
Sources: OpenBLAS 1, OpenBLAS 2, Solution

import os
os.environ["OMP_NUM_THREADS"] = "1" # export OMP_NUM_THREADS=1
os.environ["OPENBLAS_NUM_THREADS"] = "1" # export OPENBLAS_NUM_THREADS=1
os.environ["MKL_NUM_THREADS"] = "1" # export MKL_NUM_THREADS=1
os.environ["VECLIB_MAXIMUM_THREADS"] = "1" # export VECLIB_MAXIMUM_THREADS=1
os.environ["NUMEXPR_NUM_THREADS"] = "1" # export NUMEXPR_NUM_THREADS=1
from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA

import numpy as np

X, _ = load_digits(return_X_y=True)

#Copy-paste and increase the size of the dataset to see the behavior at htop.
for _ in range(8):
    X = np.vstack((X, X))

print(X.shape)
transformer = IncrementalPCA(n_components=7, batch_size=200)

transformer.partial_fit(X[:100, :])

X_transformed = transformer.fit_transform(X)

print(X_transformed.shape)

But you can explicitly set the correct BLAS environment by checking which one
is used by your numpy build like this:

>>>import numpy as np
>>>np.__config__.show()

Gave these results...

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

...meaning OpenBLAS is used by my numpy build. And all I need to write is os.environ["OPENBLAS_NUM_THREADS"] = "2" in order to limit thread usage by the numpy library.

If statement based on percentage usage of certain CPU cores

inside the for loop you should write

if percentage > 50 and i < 4:
    #DO

Operating-System-Level Changes To Speed Up Python's Multiprocessing?

I don't believe your benchmarks are executing as independent tasks as you might think they do. You didn't show the code of function but I suspect it does some synchronization.

I wrote the following benchmark. If I run the script with either the --fork or the --mp option, I always get 400 % CPU utilization (on my quad core machine) and comparable overall execution time of about 18 seconds. If called with the --threads option, however, the program effectively runs sequentially, achieving only about 100 % CPU utilization and taking a minute to complete for the reason mentioned by dave.

import multiprocessing
import os
import random
import sys
import threading

def find_lucky_number(x):
    prng = random.Random()
    prng.seed(x)
    for i in range(100000000):
        prng.random()
    return prng.randint(0, 100)

def with_threading(inputs):
    callback = lambda x : print(find_lucky_number(x))
    threads = [threading.Thread(target=callback, args=(x,)) for x in inputs]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

def with_multiprocessing(inputs):
    with multiprocessing.Pool(len(inputs)) as pool:
        for y in pool.map(find_lucky_number, inputs):
            print(y)

def with_forking(inputs):
    pids = list()
    for x in inputs:
        pid = os.fork()
        if pid == 0:
            print(find_lucky_number(x))
            sys.exit(0)
        else:
            pids.append(pid)
    for pid in pids:
        os.waitpid(pid, 0)

if __name__ == '__main__':
    inputs = [1, 2, 3, 4]
    if sys.argv[1] == '--threads':
        with_threading(inputs)
    if sys.argv[1] == '--mp':
        with_multiprocessing(inputs)
    elif sys.argv[1] == '--fork':
        with_forking(inputs)
    else:
        print("What should I do?", file=sys.stderr)
        sys.exit(1)

High Kernel CPU When Running Multiple Python Programs