Why Does My Code Run Slower with Multiple Threads Than with a Single Thread When It Is Compiled for Profiling (-Pg)

Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).

It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.

Splitting up a program into 4 threads is slower than a single thread

Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.

As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.

Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.

I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.

threadrand.h

#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_

// max number of thread states to store
const int maxThreadNum = 100;

void init_threadrand();

// requires openmp, for thread number
int threadrand();

#endif // _THREAD_RAND_H_

threadrand.cpp

#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>

// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];

// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}

// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}

Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.

Multi-threaded C++ code slower with more physical cores? (threaded C++ mex function)

TL;DR: NUMA effects combined with false-sharing are very likely to produce the observed effect only on the 2-socket system. Low-level profiling information to confirm/disprove the hypothesis.


Multi-processors systems are subject to NUMA effect. Non-uniform memory access platforms are composed of NUMA nodes which have their own local memory. Accessing to the memory of another node is more expensive (higher latency or/and smaller throughput). Multiples threads/processes located on multiple NUMA nodes accessing to the same NUMA node memory can saturate it.

Allocated memory is split in pages that are are mapped to NUMA nodes. The exact mapping policy is very dependent of the operating system (OS), its configuration and the one of the target processes. The first touch policy is quite usual. The idea is to allocate the page on the NUMA node performing the first access on the target page. Regarding the target chosen policy, OS can migrate pages from one NUMA node to another regarding the amount of remote NUMA node access. Controlling the policy is critical on NUMA platforms, especially if the application is not NUMA-aware.

The memory of multiple NUMA nodes is kept coherent thanks to a cache coherence protocol and an high-performance inter-processor communication network (Ultra Path Interconnect in your case). Cache coherence also applies between cores of the same processor. The thing is moving a cache line from (the L2 cache of) one core to another (L2 cache) is much faster than moving it from (the L3 cache of) one processor to another (L3 cache). Here is an analogy for human communication: neurons of different cortical area communicate faster than two humans together.

If your application operate in parallel on the same cache line, the false-sharing can cause a cache-line bouncing effect which is much more visible between threads spread on different processors.

This is a very complex topic. That being said, you can analyse these effects using low-level profilers like VTune (or perf on Linux). The idea is to analyse low-level performance hardware counters like L2/L3 cache misses/hit, RFOs, remote NUMA accesses, etc. This can be complex and tedious to use for someone not familiar with how processors and OS works but VTune help a bit. Note that there are some more specific tools of Intel to analyse (more easily) such specific effects that usually happens on parallel applications. AFAIK, they are part of the Intel XE set of applications (which is not free). The best to do is to avoid false-sharing using padding, design your application so each thread should operate on its own memory location as much a possible (ie. good locality), to control the NUMA allocation policy and finally to bind threads/processes to core (to avoid unexpected migrations).

Experimental benchmarks can also be used to quickly check if NUMA effect and false sharing occurs. For example, you can bind all the threads/processes on the same NUMA node and tell the OS to allocate pages on this NUMA node. This enable you to find issues related to NUMA effects. Another example is to bind two threads/processes on two different logical cores (ie. hardware thread) of the same physical cores, and then on different physical cores so to see if performance is impacted. This one help you to locate false sharing issues. That being said, such experiments can be impacted by many other effects adding noise and making the analysis pretty complex in practice for large applications. Thus, a low-level analysis based on hardware performance counters is better.

Note that some processors like AMD Zen ones are composed of multiple sub-parts (called CCD/CCX) that can be seen has multiple NUMA nodes even though there is only one processor and one socket. Such architectures will certainly become more widespread in the future. In fact, Intel also started to go in this direction with Sub-NUMA Clustering.

Will multi threading provide any performance boost?

There is only one way to optimize code: figure out what you're doing that's slow, and do less of it. A special case of "doing less of it" is to do something else instead that's faster.

So first of all, here's what I'm doing based on your posted code:

#include <fstream>
#include <sstream>
using std::ios_base;

template<typename Iterator, typename Value>
void iota(Iterator start, Iterator end, Value val) {
while (start != end) {
*(start++) = val++;
}
}

int main() {

const int dim = 1000;
const int cubesize = dim*dim*dim;
const int squaresize = dim*dim;
const int steps = 7; //ranges from 1 to 255
typedef unsigned char uchar;

uchar *partMap = new uchar[cubesize];
// dummy data. I timed this separately and it takes about
// a second, so I won't worry about its effect on overall timings.
iota(partMap, partMap + cubesize, uchar(7));
uchar *projection = new uchar[squaresize];

for (int stage = 1; stage < steps; stage++) {
for (int j = 0; j < dim; j++) {
for (int i = 0; i < dim; i++)
{
int sum = 0;
for (int k = 0; k < dim; k++)
if (partMap[(((i * dim) + k) * dim) + j] >= stage)
sum++;

projection[(j*dim) + i] = sum;
}
}

std::stringstream filename;
filename << "results" << stage << ".bin";
std::ofstream file(filename.str().c_str(),
ios_base::out | ios_base::binary | ios_base::trunc);
file.write((char *)projection, squaresize);
}

delete[] projection;
delete[] partMap;
}

(Edit: just noticed that "projection" should be an array of int, not uchar. My bad. This will make a difference to some of the timings, but hopefully not too big of one.)

Then I copied result*.bin to gold*.bin, so I can check my future changes as follows:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++ -O3 -pedantic -Wall big.cpp -o big

real 1m41.978s
user 1m39.450s
sys 0m0.451s

OK, so 100 seconds at the moment.

So, speculating that it's striding through the billion-item data array that's slow, let's try only going through once, instead of once per stage:

    uchar *projections[steps];
for (int stage = 1; stage < steps; stage++) {
projections[stage] = new uchar[squaresize];
}

for (int j = 0; j < dim; j++) {
for (int i = 0; i < dim; i++)
{
int counts[256] = {0};
for (int k = 0; k < dim; k++)
counts[partMap[(((i * dim) + k) * dim) + j]]++;

int sum = 0;
for (int idx = 255; idx >= steps; --idx) {
sum += counts[idx];
}
for (int stage = steps-1; stage > 0; --stage) {
sum += counts[stage];
projections[stage][(j*dim) + i] = sum;
}
}
}

for (int stage = 1; stage < steps; stage++) {
std::stringstream filename;
filename << "results" << stage << ".bin";
std::ofstream file(filename.str().c_str(),
ios_base::out | ios_base::binary | ios_base::trunc);
file.write((char *)projections[stage], squaresize);
}

for (int stage = 1; stage < steps; stage++) delete[] projections[stage];
delete[] partMap;

It's a bit faster:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++ -O3 -pedantic -Wall big.cpp -o big

real 1m15.176s
user 1m13.772s
sys 0m0.841s

Now, steps is quite small in this example, so we're doing a lot of unnecessary work with the "counts" array. Without even profiling, I'm guessing that counting to 256 twice (once to clear the array and once to sum it) is quite significant compared with counting to 1000 (to run along our column). So let's change that:

    for (int j = 0; j < dim; j++) {
for (int i = 0; i < dim; i++)
{
// steps+1, not steps. I got this wrong the first time,
// which at least proved that my diffs work as a check
// of the answer...
int counts[steps+1] = {0};
for (int k = 0; k < dim; k++) {
uchar val = partMap[(((i * dim) + k) * dim) + j];
if (val >= steps)
counts[steps]++;
else counts[val]++;
}

int sum = counts[steps];
for (int stage = steps-1; stage > 0; --stage) {
sum += counts[stage];
projections[stage][(j*dim) + i] = sum;
}
}
}

Now we're only using as many buckets as we actually need.

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++ -O3 -pedantic -Wall big.cpp -o big

real 0m27.643s
user 0m26.551s
sys 0m0.483s

Hurrah. The code is nearly 4 times as fast as the first version, and produces the same results. All I've done is change what order the maths is done: we haven't even looked at multi-threading or prefetching yet. And I haven't attempted any highly technical loop optimisation, just left it to the compiler. So this can be considered a decent start.

However it's still taking an order of magnitude longer than the 1s which iota runs in. So there are probably big gains still to find. One main difference is that iota runs over the 1d array in sequential order, instead of leaping about all over the place. As I said in my first answer, you should aim to always use sequential order on the cube.

So, let's make a one-line change, switching the i and j loops:

            for (int i = 0; i < dim; i++)
for (int j = 0; j < dim; j++) {

This still isn't sequential order, but it does mean we're focussing on one million-byte slice of our cube at a time. A modern CPU has at least 4MB cache, so with a bit of luck we'll only hit main memory for any given part of the cube once in the entire program. With even better locality we could reduce the traffic in and out of L1 cache, too, but main memory is the slowest.

How much difference does it make?

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++ -O3 -pedantic -Wall big.cpp -o big

real 0m8.221s
user 0m4.507s
sys 0m0.514s

Not bad. In fact, this change alone brings the original code from 100s to 20s. So this is responsible for a factor of 5, and everything else I did is responsible for another factor of 5 (I think the difference between 'user' and 'real' time in the above is mostly accounted for by the fact my virus scanner is running, which it wasn't earlier. 'user' is how much time the program occupied a CPU, 'real' includes time spent suspended, either waiting on I/O or giving another process time to run).

Of course, my bucket sort relies on the fact that whatever we're doing with the values in each column is commutative and associative. Reducing the number of buckets only worked because large values are all treated the same. This might not be true for all your operations, so you'll have to look at the inner loop of each one in turn to figure out what to do with it.

And the code is a bit more complicated. Instead of running over the data doing "blah" for each stage, we're computing all the stages at the same time in a single run over the data. If you start doing row and column computations in a single pass, as I recommended in my first answer, this will get worse. You may have to start breaking your code into functions to keep it readable.

Finally, a lot of my performance gain came from an optimisation for the fact that "steps" is small. With steps=100, I get:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++ -O3 -pedantic -Wall big.cpp -o big

real 0m22.262s
user 0m10.108s
sys 0m1.029s

This isn't so bad. With steps=100 the original code probably takes about 1400 seconds, although I'm not going to run it to prove that. But it's worth remembering that I haven't completely taken away the time dependency on "steps", just made it sub-linear.

C++ Merge sort slower with threads

I have compiled and run your exact program with no modifications other than adding includes and the results are more or less as you expected:

size > 250: 169
size > 500: 85
size > 1000: 50
size > 10000: 29
size > 250000: 42
size > 500000: 89

Based on your screen shot, I gather you are running your code from within Visual Studio. The default run button will attach a debugger to your executable and reduce the run time performance. Instead, press Ctrl+F5 to run without a debugger or from the menu Debug -> Start Without Debugging.

Why does the second for loop always execute faster than the first one?

Probably because the classes (e.g. Console) need to be JIT-compiled the first time through. You'll get the best metrics by calling all methods (to JIT them (warm then up)) first, then performing the test.

As other users have indicated, 4 passes is never going to be enough to to show you the difference.

Incidentally, the difference in performance between for and foreach will be negligible and the readability benefits of using foreach almost always outweigh any marginal performance benefit.

How to profile multi-threaded C++ application on Linux?

Edit: added another answer on poor man's profiler, which IMHO is better for multithreaded apps.

Have a look at oprofile. The profiling overhead of this tool is negligible and it supports multithreaded applications---as long as you don't want to profile mutex contention (which is a very important part of profiling multithreaded applications)

Why does make -j perform better when it is passed a number larger than the number of cores available?

There's more to compiling than CPU speed and number of cores available: disk bandwidth and memory bandwith matter a lot too.

In your case, I imagine that each CPU HT sibling is getting roughly 4 processes to execute. As it starts one, it blocks on disk IO and moves onto the next process. The second one tries to open a second file, blocks on disk IO, and the sibling moves onto the next process. Starting four compilers before the first disk IO is ready wouldn't surprise me.

So when the first one finally read in the program source, the compiler must start hunting through directories to find the #included files. Each one requires some open() calls followed by read() calls, all of which can block, and all of which will relinquish the sibling for other processes to run.

Now multiply that by eight siblings -- each HT core will run until it blocks on memory access, at which point it'll swap over to the other sibling, and run for a while. Once the memory of the first sibling has been fetched into the cache, it is probably time for the second sibling to stall while waiting for memory.

There is an upper limit on how much faster you can get your compiles to run by using make -j, but twice-number-of-cpus has been a good starting point for me in the past.

Appengine, performance degradation with python27

Somewhere on Usenet I read a statement like this from Google "he Python 2.7 runtime is slower than the Python 2.5 runtime in some cases and faster in others. We aren't publicizing the reasons why at this point.". Seems nobody has found so far a scenario where 2.7 is faster than 2.5 thus ...

I read into this

  1. Google is aware of a preformace issue.
  2. They not sure how to handle it.

My profiling indicates that python 2.7 multithreaded applications spend ca. 35 % of their time in {method 'acquire' of 'thread.lock' objects} - and seemingly that happens in googles RPC code. There are also indication that importing has serious locking issues.

Basically you can't do anything about it except waiting for the AppEngine to fix it. See also this comprehensive documentation about the slowdown.

Factors which almost certainly don't play a significant role in this are:

  • uploading pyc files (the GAE infrastructure does that for you)
  • the server provisioning (GAE is such a massive scale and even in closed beta 2.7. was slow)
  • WSGI vs. CGI (wouldn't explain such a huge performance hit)

The ugly thing is that Google did a massive price hike in two steps but told us "by using multithreaded python 2.7 you can run so much more effective that the new prices don't look so bad". Unfortunately the Python 2.7. runtime is still labeled "experimental" and doesn't deliver production quality performance.



Related Topics



Leave a reply



Submit