How-To Run Tensorflow on Multiple Core and Threads

How can I run Tensorflow on one single core?

To run Tensorflow on one single CPU thread, I use:

session_conf = tf.ConfigProto(
sess = tf.Session(config=session_conf)

device_count limits the number of CPUs being used, not the number of cores or threads.

tensorflow/tensorflow/core/protobuf/config.proto says:

message ConfigProto {
// Map from device type name (e.g., "CPU" or "GPU" ) to maximum
// number of devices of that type to use. If a particular device
// type is not found in the map, the system picks an appropriate
// number.
map<string, int32> device_count = 1;

On Linux you can run sudo dmidecode -t 4 | egrep -i "Designation|Intel|core|thread" to see how many CPUs/cores/threads you have, e.g. the following has 2 CPUs, each of them has 8 cores, each of them has 2 threads, which gives a total of 2*8*2=32 threads:

fra@s:~$ sudo dmidecode -t 4 | egrep -i "Designation|Intel|core|thread"
Socket Designation: CPU1
Manufacturer: Intel
HTT (Multi-threading)
Version: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
Core Count: 8
Core Enabled: 8
Thread Count: 16
Hardware Thread
Socket Designation: CPU2
Manufacturer: Intel
HTT (Multi-threading)
Version: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
Core Count: 8
Core Enabled: 8
Thread Count: 16
Hardware Thread

Tested with Tensorflow 0.12.1 and 1.0.0 with Ubuntu 14.04.5 LTS x64 and Ubuntu 16.04 LTS x64.

TensorFlow Execution on a single (multi-core) CPU Device

Re ThreadPool: When Tensorflow uses DirectSession (as it does in your case), it uses Eigen's ThreadPool. I could not get a web link to the official version of Eigen used in TensorFlow, but here is a link to the thread pool code. This thread pool is using this queue implementation RunQueue. There is one queue per thread.

Re inline_ready: Executor:Process is scheduled in some Eigen Thread. When it runs it executes some nodes. As these nodes are done, they make other nodes (tensorflow operations) ready. Some of these nodes are not expensive. They are added to inline_ready and executed in the same thread, without yielding. Other nodes are expensive and are not executed "immediately" in the same thread. Their execution is scheduled through the Eigen thread pool.

Re sync/async kernels: Tensorflow operations can be backed by synchronous (most CPU kernels) or asynchronous kernels (most GPU kernels). Synchronous kernels are executed in the thread running Process. Asynchronous kernels are dispatched to their device (usually GPU) to be executed. When asynchronous kernels are done, they invoke NodeDone method.

Re Intra Op ThreadPool: The intra op thread pool is made available to kernels to run their computation in parallel. Most CPU kernels don't use it (and GPU kernels just dispatch to GPU) and run synchronously in the thread that called the Compute method. Depending on configuration there is either one intra op thread pool shared by all devices (CPUs), or each device has its own. Kernels simply schedule their work on this thread pool. Here is an example of one such kernel. If there are more tasks than threads, they are scheduled and executed in unspecified order. Here is the ThreadPool interface exposed to kernels.

I don't know of any way tensorflow influences the scheduling of OS threads. You can ask it to do some spinning (i.e. not immediately yield the thread to OS) to minimize latency (from OS scheduling), but that is about it.

These internal details are not documented on purpose as they are subject to change. If you are using tensorflow through Python API, all you should need to know that your ops will execute when their inputs are ready. If you want to enforce some order beyond this, you should use:

with tf.control_dependencies(<tensors_that_you_want_computed_before_the_ops_inside_this_block>):

If you are writing a custom CPU kernel and want to do parallelism inside it (usually needed rarely for very expensive kernels), the thread pool interface linked above is what you can rely on.

Keras/TF CPU creating too many threads

This is why tensorflow created many threads.

Using the mentioned 2 types of parallelism (inter and intra) you have limited control over the number of threads generated by TensorFlow. The minimum number of threads that you can get by setting these two variables is N, where N is the number of cores on your cpu (I don't know if you use gpu).

intra_op_parallelism_threads = 1
inter_op_parallelism_threads = 1

Even by setting the environment variables OMP_NUM_THREADS and MKL_NUM_THREADS can't help in further reducing the number of threads.

The following discussions suggest that without changing the source code of TensorFlow, it is not possible to reduce the number threads below N.

  • How can I confine TensorFlow C API to use one and only one thread in total
  • How to disable Tensorflow's multi-threading?
  • How to stop TensorFlow from multi-threading

Related Topics

Leave a reply