Why Is My Python App Stalled with 'System'/Kernel CPU Time

Why is my Python app stalled with 'system' / kernel CPU time

OK. I have the answer to my own question. Yes, it's taken me over 3 months to get this far.

It appears to be GIL thrashing in Python that is the reason for the massive 'system' CPU spikes and associated pauses. Here is a good explanation of where the thrashing comes from. That presentation also pointed me in the right direction.

Python 3.2 introduced a new GIL implementation to avoid this thrashing. The result can be shown with a simple threaded example (taken from the presentation above):

from threading import Thread
import psutil

def countdown():
    n = 100000000
    while n > 0:
        n -= 1

t1 = Thread(target=countdown)
t2 = Thread(target=countdown)
t1.start(); t2.start()
t1.join(); t2.join()

print(psutil.Process().cpu_times())

On my Macbook Pro with Python 2.7.9 this uses 14.7s of 'user' CPU and 13.2s of 'system' CPU.

Python 3.4 uses 15.0s of 'user' (slightly more) but only 0.2s of 'system'.

So, the GIL is still in place, it still only runs as fast as when the code is single threaded, but it avoids all the GIL contention of Python 2 that manifests as kernel ('system') CPU time. This contention, I believe, is what was causing the issues of the original question.

Update

An additional cause to the CPU problem was found to be with OpenCV/TBB. Fully documented in this SO question.

High Kernel CPU when running multiple python programs

If the problem exists in kernel, you should narrow down a problem using a profiler such as OProfile or perf.

I.e. run perf record -a -g and than read profiling data saved into perf data using perf report. See also: linux perf: how to interpret and find hotspots.

In your case high CPU usage is caused by competition for /dev/urandom -- it allows only one thread to read from it, but multiple Python processes are doing so.

Python module random is using it only for initialization. I.e:

$ strace python -c 'import random;
while True:
    random.random()'
open("/dev/urandom", O_RDONLY)     = 4
read(4, "\16\36\366\36}"..., 2500) = 2500
close(4)                                   <--- /dev/urandom is closed

You may also explicitly ask for /dev/urandom by using os.urandom or SystemRandom class. So check your code which is dealing with random numbers.

Python / OpenCV application lockup issue

I've managed to get a thread dump from gdb right at the point where 40+ threads are showing 100% 'system' CPU time.

Here's the backtrace which is the same for every one of those threads:

#0  0x00007fffebe9b407 in cv::ThresholdRunner::operator()(cv::Range const&) const () from /usr/local/lib/libopencv_imgproc.so.3.0
#1  0x00007fffecfe44a0 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, (anonymous namespace)::ProxyLoopBody, tbb::auto_partitioner const>::execute() () from /usr/local/lib/libopencv_core.so.3.0
#2  0x00007fffe967496a in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
#3  0x00007fffe96705a6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
#4  0x00007fffe966fc6b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
#5  0x00007fffe966d65f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
#6  0x00007fffe966d859 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
#7  0x00007ffff76e9df5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007ffff6d0e1ad in clone () from /lib64/libc.so.6

My original question put Python and Linux front and center but the issue appears to lie with TBB and/or OpenCV. Since OpenCV with TBB is so widely used I presume it has to also involve the interplay with my specific environment somehow. Maybe because it's a 64 core machine?

I have recompiled OpenCV with TBB turned off and the problem has not reappeared so far. But my app now runs slower.

I have posted this as a bug to OpenCV and will update this answer with anything that comes from that.

root.query_pointer()._data causes high CPU usage

To keep the CPU usual steady I put display.Display().screen() before the loop so that it didn't have to do so much work all the time. The screen shouldn't change so nor should that value so it made sense to set it up before.

import time
from Xlib import display
disp = display.Display().screen()
while True:
    d = disp.root.query_pointer()._data
    print(d["root_x"], d["root_y"])
    time.sleep(0.1)

I've tested it and it stays at about 0.3% for me.

Hope it this helps :)

Trouble with TensorFlow in Jupyter Notebook

Update

TensorFlow website supports five installations.

To my understanding, using Pip installation directly would be fine to import TensorFlow in Jupyter Notebook (as long as Jupyter Notebook was installed and there were no other issues) b/z it didn't create any virtual environments.

Using virtualenv install and conda install would need to install jupyter into the newly created TensorFlow environment to allow TensorFlow to work in Jupyter Notebook (see the following original post section for more details).

I believe docker install may require some port setup in the VirtualBox to make TensorFlow work in Jupyter Notebook (see this post).

For installing from sources, it also depends on which environment the source code is built and installed into. If it's installed into a freshly created virtual environment or an virtual environment which didn't have Jupyter Notebook installed, it would also need to install Jupyter Notebook into the virtual environment to use Tensorflow in Jupyter Notebook.

Original Post

To use tensorflow in Ipython and/or Jupyter(Ipython) Notebook, you'll need to install Ipython and Jupyter (after installing tensorflow) under the tensorflow activated environment.

Before install Ipython and Jupyter under tensorflow environment, if you do the following commands in terminal:

username$ source activate tensorflow

(tensorflow)username$ which ipython
(tensorflow)username$ /Users/username/anaconda/bin/ipython

(tensorflow)username$ which jupyter
(tensorflow)username$ /Users/username/anaconda/bin/jupyter

(tensorflow)username$ which python
(tensorflow)username$ /User/username//anaconda/envs/tensorflow/bin/python

This is telling you that when you open python from terminal, it is using the one installed in the "environments" where tensorflow is installed. Therefore you can actually import tensorflow successfully. However, if you are trying to run ipython and/or jupyter notebook, these are not installed under the "environments" equipped with tensorflow, hence it has to go back to use the regular environment which has no tensorflow module, hence you get an import error.

You can verify this by listing out the items under envs/tensorflow/bin directory:

(tensorflow) username$ ls /User/username/anaconda/envs/tensorflow/bin/

You will see that there are no "ipython" and/or "jupyer" listing out.

To use tensorflow with Ipython and/or Jupyter notebook, simply install them into the tensorflow environment:

(tensorflow) username$ conda install ipython
(tensorflow) username$ pip install jupyter #(use pip3 for python3)

After installing them, there should be a "jupyer" and a "ipython" show up in the envs/tensorflow/bin/ directory.

Notes:
Before trying to import tensorflow module in jupyter notebook, try close the notebook. And "source deactivate tensorflow" first, and then reactivate it ("source activate tensorflow") to make sure things are "on the same page". Then reopen the notebook and try import tensorflow. It should be import successfully (worked on mine at least).

Htop cpu bar red, 100% kernel time

I solved the problem and found possible causes.

The CPU usage is high means the CPU is working, so this means no disk IO limitation is happening.
The GPU usage is low means that GPU is not correctly fed.
This means RAM is the most likely bottleneck for my case.

As mentioned in the GitHub issue, multi-process accessing the same python object causes the object ref-count to increase. In fork mode, this triggers page allocation thus slowing down the system performance.

This system behavior can not be detected by python memory allocation libs such as Memray[https://github.com/bloomberg/memray] or so. But might be detected by other system-level memory tools such as Valgrind [https://valgrind.org/]

https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662

The final solution is to reduce accessing python objects from the forked process.

Why Is My Python App Stalled with 'System'/Kernel CPU Time