Running Simulation with Hyperthreading Doubles Runtime

Running simulation with hyperthreading doubles runtime

Maybe the context switches produce more overhead, caused by 6 massivly calculating processes and only 4 real cores. If the processes compete for the cpu-ressources, they may use inefficient the cpu-caches.

If you only enable 4 instead of 6 core, what's the result?

Understanding software parallelization on a linux workstation

If you have 8 cpu threads available, and your programs consume 100% of a single CPU, it does not make sense to run more than 8 programs at a time.

If your programs are multi-threaded, then you may want to have fewer than 8 processes running at a time. If your programs occasionally use less than 100% of a single CPU (perhaps if they're waiting for IO), then you may want to run more than 8 processes at a time.

Even if the process limit for your user is extremely high, other resources could be exhausted much sooner - for instance, RAM. If you launch 200 processes and they exhaust RAM, then the operating system will respond by satisfying requests for RAM by swapping out some other process's RAM to disk; and now the computer needlessly crawls to a halt because 200 processes are waiting on IO to get their memory back from disk, only to have it be written out again because some other process wants to run. This is called thrashing.

If your goal is to perform some batch computation, it does not make sense to load the computer any more than enough processes to keep all CPU cores at 100% utilization. Anything more is waste.

Edit - Clarification on terminology.

A single computer can have more than one CPU socket.
A single CPU can have more than one CPU core.
A single CPU core can support simultaneous execution of more than one stream of instructions. Hyperthreading is an example of this.
A stream of instructions is what we typically call a "thread", either in the context of the operating system, processes, or in the CPU.

So I could have a computer with 2 sockets, with each socket containing a 4-core CPU, where each of those CPUs supports hyperthreading and thus supports two threads per core.

Such a computer could execute 2 * 4 * 2 = 16 threads simultaneously.

A single process can have as many threads as it wants, until some resources is exhausted - raw RAM, internal operating system data structures, etc. Each process has at least one thread.

It's important to note that tricks like hyperthreading may not scale performance linearly. When you have unhyperthreaded CPU cores, those cores contain enough parts to be able to execute a single stream of instructions all by itself; aside from memory access, it doesn't share anything with the rest of the other cores, and so performance can scale linearly.

However, each core has a lot of parts - and during some types of computations, some of those parts are inactive while others are active. And during other types of computations could be the opposite. Doing a lot of floating-point math? Well, then the integer math unit in the core might be idle. Doing a lot of integer math? Well, then the floating-point math unit might be idle.

Hyperthreading seeks to increase perform, even if only a little bit, by exploiting these temporarily unused units within a core; while the floating point unit is busy, schedule something that can use the integer unit.

...

As far as the operating system is concerned when it comes to scheduling is how many threads across all processes are runnable. If I have one process with 3 runnable threads, a second process with one runnable thread, and a third process with 10 runnable threads, then the OS will want to run a total of 3 + 1 + 10 = 14 threads.

If there are more runnable program threads than there are CPU execution threads, then the operating system will run as many as it can, and the others will sit there doing nothing, waiting. Meanwhile, those programs and those threads may have allocated a bunch of memory.

Lets say I have a computer with 128 GB of RAM and CPU resources such that the hardware can execute a total of 16 threads at the same time. I have a program that uses 2 GB of memory to perform a simple simulation, and that program only creates one thread to perform its execution, and each program needs 100s of CPU time to finish. What would happen if I were to try to run 16 instances of that program at the same time?

Each program would allocate 2 GB * 16 = 32 GB of ram to hold its state, and then begin performing its calculations. Since each program creates a single thread, and there are 16 CPU execution threads available, every program can run on the CPU without competing for CPU time. The total time we'd need to wait for the whole batch to finish would be 100 s: 16 processes / 16 cpu execution threads * 100s.

Now what if I increase that to 32 programs running at the same time? Well, we'll allocate a total of 64GB of RAM, and at any one point in time, only 16 of them will be running. This is fine, nothing bad will happen because we've not exhausted RAM (and presumably any other resource), and the programs will all run efficiently and eventually finish. Runtime will be approximately twice as long at 200s.

Ok, now what happens if we try to run 128 programs at the same time? We'll run out of memory: 128 * 2 = 256 GB of ram, more than double what the hardware has. The operating system will respond by swapping memory to dis and reading it back in as needed, but it'll have to do this very frequently, and it'll have to wait for the disk.

If you had enough ram, this would run in 800s (128 / 16 * 100). Since you don't, it's very possible it could take an order of magnitude longer.

User thread, Kernel thread, software thread and hardware thread

"Hardware thread" is a bad name. It was chosen as a term of art by CPU designers, without much regard for what software developers think "thread" means.

When an operating system interrupts a running thread so that some other thread may be allowed to use the CPU, it must save enough of the state of the CPU so that the thread can be resumed again later on. Mostly that saved state consists of the program counter, the stack pointer, and other CPU registers that are part of the programmer's model of the CPU.

A so-called "hyperthreaded CPU" has two or more complete sets of those registers. That allows it to execute instructions on behalf of two or more program threads without any need for the operating system to intervene.

Experts in the field like nice, short names for things. Instead of talking about "complete sets of context registers," they just call them "hardware threads."

efficiency in multithreading

The best way to apportion the workload is workload-dependent.

Broadly - for parallelizable workload, use OpenMP; for heterogeneous workload, use a thread pool. Avoid managing your own threads if you can.

Monte Carlo simulation should be a good candidate for truly parallel code rather than thread pool.

By the way - in case you are on Visual C++, there is in Visual C++ v10 an interesting new Concurrency Runtime for precisely this type of problem. This is somewhat analogous to the Task Parallel Library that was added to .Net Framework 4 to ease the implementation of multicore/multi-CPU code.

Is it worth using IPython parallel with scipy's eig?

Interesting problem. Because I would think it should be possible to achieve better scaling I investigated the performance with a small "benchmark". With this test I compared the performance of single and multi-threaded eig (multi-threading being delivered through MKL LAPACK/BLAS routines) with IPython parallelized eig. To see what difference it would make I varied the view type, the number of engines and MKL threading as well as the method of distributing the matrices over the engines.

Here are the results on an old AMD dual core system:

 m_size=300, n_mat=64, repeat=3
+------------------------------------+----------------------+
| settings                           | speedup factor       |
+--------+------+------+-------------+-----------+----------+
| func   | neng | nmkl | view type   | vs single | vs multi |
+--------+------+------+-------------+-----------+----------+
| ip_map |    2 |    1 | direct_view |      1.67 |     1.62 |
| ip_map |    2 |    1 |  loadb_view |      1.60 |     1.55 |
| ip_map |    2 |    2 | direct_view |      1.59 |     1.54 |
| ip_map |    2 |    2 |  loadb_view |      0.94 |     0.91 |
| ip_map |    4 |    1 | direct_view |      1.69 |     1.64 |
| ip_map |    4 |    1 |  loadb_view |      1.61 |     1.57 |
| ip_map |    4 |    2 | direct_view |      1.15 |     1.12 |
| ip_map |    4 |    2 |  loadb_view |      0.88 |     0.85 |
| parfor |    2 |    1 | direct_view |      0.81 |     0.79 |
| parfor |    2 |    1 |  loadb_view |      1.61 |     1.56 |
| parfor |    2 |    2 | direct_view |      0.71 |     0.69 |
| parfor |    2 |    2 |  loadb_view |      0.94 |     0.92 |
| parfor |    4 |    1 | direct_view |      0.41 |     0.40 |
| parfor |    4 |    1 |  loadb_view |      1.62 |     1.58 |
| parfor |    4 |    2 | direct_view |      0.34 |     0.33 |
| parfor |    4 |    2 |  loadb_view |      0.90 |     0.88 |
+--------+------+------+-------------+-----------+----------+

As you see the performance gain varies greatly over the different settings used, with a maximum of 1.64 times that of regular multi threaded eig. In these results the parfor function you used performs badly unless MKL threading is disabled on the engines (using view.apply_sync(mkl.set_num_threads, 1)).

Varying the matrix size also gives a noteworthy difference. The speedup of using ip_map on a direct_view with 4 engines and MKL threading disabled vs regular multi threaded eig:

 n_mat=32, repeat=3
+--------+----------+
| m_size | vs multi |
+--------+----------+
|     50 |     0.78 |
|    100 |     1.44 |
|    150 |     1.71 |
|    200 |     1.75 |
|    300 |     1.68 |
|    400 |     1.60 |
|    500 |     1.57 |
+--------+----------+

Apparently for relatively small matrices there is a performance penalty, for intermediate size the speedup is the largest and for larger matrices the speedup decreases again. I you could achieve a performance gain of 1.75 that would make using IPython.parallel worthwhile in my opinion.

I did some tests earlier on an Intel dual core laptop also, but I got some funny results, apparently the laptop was overheating. But on that system the speedups were generally a little lower, around 1.5-1.6 max.

Now I think the answer to your question should be: It depends. The performance gain depends on the hardware, the BLAS/LAPACK library, the problem size and the way IPython.parallel is deployed, among other things perhaps that I'm not aware of. And last but not least, whether it's worth it also depends on how much of a performance gain you think is worthwhile.

The code that I used:

from __future__ import print_function
from numpy.random import rand
from IPython.parallel import Client
from mkl import set_num_threads
from timeit import default_timer as clock
from scipy.linalg import eig
from functools import partial
from itertools import product

eig = partial(eig, right=False)  # desired keyword arg as standard

class Bench(object):
    def __init__(self, m_size, n_mat, repeat=3):
        self.n_mat = n_mat
        self.matrix = rand(n_mat, m_size, m_size)
        self.repeat = repeat
        self.rc = Client()

    def map(self):
        results = map(eig, self.matrix)

    def ip_map(self):
        results = self.view.map_sync(eig, self.matrix)

    def parfor(self):
        results = {}
        for i in range(self.n_mat):
            results[i] = self.view.apply_async(eig, self.matrix[i,:,:])
        for i in range(self.n_mat):
            results[i] = results[i].get()

    def timer(self, func):
        t = clock()
        func()
        return clock() - t

    def run(self, func, n_engines, n_mkl, view_method):
        self.view = view_method(range(n_engines))
        self.view.apply_sync(set_num_threads, n_mkl)
        set_num_threads(n_mkl)
        return min(self.timer(func) for _ in range(self.repeat))

    def run_all(self):
        funcs = self.ip_map, self.parfor
        n_engines = 2, 4
        n_mkls = 1, 2
        views = self.rc.direct_view, self.rc.load_balanced_view
        times = []
        for n_mkl in n_mkls:
            args = self.map, 0, n_mkl, views[0]
            times.append(self.run(*args))
        for args in product(funcs, n_engines, n_mkls, views):
            times.append(self.run(*args))
        return times

Dunno if it matters but to start 4 IPython parallel engines I typed at the command line:

ipcluster start -n 4

Hope this helps :)

Async method returning a completed task unexpectedly slow

Your garbage collector is most probably configured to workstation mode (the default), which uses a single thread to reclaim the memory allocated by unused objects. For a machine with 32 cores, one core will certainly not be enough to clean up the mess that the rest 31 cores are constantly producing! So you should probably switch to server mode:

<configuration>
  <runtime>
    <gcServer enabled="true"></gcServer>
  </runtime>
</configuration>

Background server garbage collection uses multiple threads, typically a dedicated thread for each logical processor.

By using ValueTasks instead of Tasks you avoid memory allocations in the heap because the ValueTask is a struct that is allocated in the stack and has no need for garbage collection. But this is the case only if it wraps the result of a completed task. If it wraps an incomplete task then it offers no advantage. It is suitable for cases where you have to await tens of millions of tasks, and you expect that the vast majority of them will be completed.

Dividing work to more threads takes more time, why?

The most expensive operation your thread performs is calling rand(). The rand() is a naive, simplistic, and generally non-MT scalable function (since it guarantees for the same seed to produce the same sequence of random numbers). I think the lock inside the rand() is serializing all the threads.(*)

A simple trick to confirm whether it is the problem or not, is to start the program under debugger, then, several times: pause it, capture the stack trace of the threads, continue. Whatever appears most often in the stacktraces, very likely is the bottleneck.

(*) What makes it even slower is the fact that lock contention causes additional performance penalty. Also, the many threads add additional overhead of the process scheduling and the context switches.

Running Simulation with Hyperthreading Doubles Runtime