Code Runs 6 Times Slower with 2 Threads Than with 1

Code runs 6 times slower with 2 threads than with 1

Why are 2 threads 6x slower than 1 thread?

You are getting hit by a bad case of false sharing.

After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?

You are bottlenecked by your memory bandwidth.

False Sharing:

The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.

Each thread is running this loop:

for(uint32_t i = 0; i < length; i ++) {
    *result += (*datavec).at(start + i);
}

And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.

Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.

Memory Bandwidth:

Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.

Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.

In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.

Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).

It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.

C++ threaded application runs slower than non-threaded

To me it seems you are starting a new thread for each single prime number check. That is not good IMHO, because the thread startup/shutdown plus the synchronization adds to calculation of each prime. Starting a thread can be quite slow.

I would recommend to start those 4 threads outside of the main for loop and process 1/4 of the range in each thread. But this might need some additional synchronization, because to check a prime, the code above apparently needs to first have primes up to sqrt N available.

From my point of view, it might be easier to use the Sieve of Erastothenes algorithm, which might be much easier to parallelize without any locking (however might still suffer the problem known as "false sharing").

EDIT

Here I quickly created a version using the Sieve of Erastothenes:

void processSieve(const unsigned long long& l,
    const unsigned long long& start,
    const unsigned long long& end,
    const unsigned long long& step,
    vector<char> &is_prime)
{
    for (unsigned long long i = start; i <= end; i += step)
        if (is_prime[i])
            for (unsigned long long j = i + i; j <= l; j += i)
                is_prime[j] = 0;
}

void runSieve(const unsigned long long& l)
{
    vector<char> is_prime(l + 1, 1);
    unsigned long long end = sqrt(l);
    processSieve(l, 2, end, 1, is_prime);
    primeContainer.clear();
    for (unsigned long long i = 1; i <= l; ++i)
        if (is_prime[i])
            primeContainer.insert(i);
}

void runSieveThreads(const unsigned long long& l)
{
    vector<char> is_prime(l + 1, 1);
    unsigned long long end = sqrt(l);
    vector<thread> threads;
    threads.reserve(cpuCount);
    for (unsigned long long i = 0; i < cpuCount; ++i)
        threads.emplace_back(processSieve, l, 2 + i, end, cpuCount, ref(is_prime));
    for (unsigned long long i = 0; i < cpuCount; ++i)
        threads[i].join();
    primeContainer.clear();
    for (unsigned long long i = 1; i <= l; ++i)
        if (is_prime[i])
            primeContainer.insert(i);
}

The measurement results, primes up to 1 000 000 (MSVC 2013, Release):

runLoop: 204.02 ms
runLoopThread: 43947.4 ms
runSieve: 30.003 ms
runSieveThreads (8 cores): 24.0024 ms

Up to 10 0000 000:

runLoop: 4387.44 ms
// runLoopThread disabled, taking too long
runSieve: 350.035 ms
runSieveThreads (8 cores): 285.029 ms

The times include final processing of the vector and pushing the results to the prime set.

As you can see, the Sieve version is much faster than your version even in single threaded version (for your mutex version I had to change the lock to regular mutex locks because the MSVC 2013 does not have shared_lock, so the results are likely much worse than yours).

But you can see that the multithreaded version of the sieve still does not run as fast as expected (8 cores, i.e. 8 threads, linear speedup would be 8x faster than single thread), although there is no locking (trading off that some numbers might run unnecessarily if they were not marked as "no primes" by the other threads yet, but in general the results should be stable, because only set to 0 every time, does not matter if set simultaneously by multiple threads). The cause why the speedup is not linear is most likely because of the "false sharing" problem as I mentioned before - the threads writing the zeros invalidate each other cache lines.

Sorting seems to be slower with 2 threads instead of 1

How can this be possible?

Unlike C++, Python is quite difficult to parallelize because of the GIL.

While collections.deque's append and popleft are thread-safe, this does not guarantee that they'll perform well in a non-serial paradigm.

Is it linked to this question?

No. The GIL is a property of CPython. It is totally disjoint from false sharing.

It takes way longer (almost 2 twice as long) with two threads than it does when I do it normally.

This is because the GIL doesn't support shared memory multithreading. As such, you're essentially running your code serially twice.

Multi-Threaded program runs slower than single threaded

I understand that the result will not be correct but it is crawling compared to single threaded one
I kill the program after 5-6 minutes of execution since it is running very slowly

First off, I assume you are using a Executors.newFixedThreadPool() which allocates a fixed number of threads and not a cached thread pool.

It seems to me that you may be creating some huge number of jobs and you program is running out of memory. As you fill up memory the JVM works harder and harder on GC which shows up as your process getting slower and slower. You could connect to the application using jconsole to verify the number of threads and the memory. You could also do a thread-dump on it (kill -QUIT pid) and see how many jobs you have forked.

If you are creating some huge number of jobs and your ExecutorService just can't keep up, then you will need to throttle the job production. There are a couple different ways to do that. Here's what I use:

Process Large File for HTTP Calls in Java

Couple other solutions linked from there.

I thought creating threads itself might have overhead reduced the pool size to 4 (I have an eight core processor) but that did nothing.

Yeah this doesn't seem like a processor issue. I would move it back to 8. If this really makes the box unusable then try 7 or 6 threads in your ExecutorService.

Edit:

After looking at the code more, you are doing a bunch of unsynchronized data updating that is going to cause strange results. Anytime you modify shared memory (in your case shared static fields) then you are going to have to have some synchronization both in terms of mutex (++) and memory sharing.

I'd consider using AtomicLong and others if you can but you should read some threading tutorials on shared memory and synchronization: http://docs.oracle.com/javase/tutorial/essential/concurrency/sync.html

C++ multithreading performance slower than single threaded code

You did not pass any std::launch policy to std::async, so it leaves the implementation a lot of freedom.

Behaves as if (2) is called with policy being std::launch::async | std::launch::deferred. In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.

But also be aware that more generally, using more threads, especially on small tasks may not be faster.

Where dist_calculation or any task you want to thread is a small amount of work, be aware of the overheads. Creating a new thread has a relatively high cost, and there is also overhead for whatever internal pool std::async uses, promises, and futures.
Additionally, the way written, it is likely you create more vectors, with more dynamic memory, and you will need to merge the results which will also have some cost.
In more complex cases, if synchronization, e.g. with std::mutex is involved, that may cost more performance than additional threads gain.
In some cases, the bottleneck will not be the CPU. For example it may be disk/storage speed (including for page/swap file), network speed (including remote servers), or even memory bandwidth (excepting NUMA aware optimizations, they are a lot more complex than just using std::async). Multithreading in these will just add overheads, but no benefit.

You should make use of other basic optimisations where possible first, such as reserve the size of the vectors to avoid unneeded allocations and copies, maybe resize and use the vector[index] = a instead of push_back, etc.

For something as simple as abs(centre - input[i]) you might get a lot more improvement from SIMD (Single instruction multiple data) optimisations. e.g. make sure you are compiling with any optimizations such as SSE2 enabled, and if the compiler doesn't optimise the loop appropriately (I think the push_back may interfere, test!), alter it slightly so it does, or maybe even use the vector instructions explicitly (for x86 check out _mm_add_epi32 etc.).

Code Runs 6 Times Slower with 2 Threads Than with 1