Code runs 6 times slower with 2 threads than with 1
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result
variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result
variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result
.
Normally, the compiler should put *result
into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?
What do you get without the -pg
flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).
It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.
C++ threaded application runs slower than non-threaded
To me it seems you are starting a new thread for each single prime number check. That is not good IMHO, because the thread startup/shutdown plus the synchronization adds to calculation of each prime. Starting a thread can be quite slow.
I would recommend to start those 4 threads outside of the main for
loop and process 1/4 of the range in each thread. But this might need some additional synchronization, because to check a prime, the code above apparently needs to first have primes up to sqrt N available.
From my point of view, it might be easier to use the Sieve of Erastothenes algorithm, which might be much easier to parallelize without any locking (however might still suffer the problem known as "false sharing").
EDIT
Here I quickly created a version using the Sieve of Erastothenes:
void processSieve(const unsigned long long& l,
const unsigned long long& start,
const unsigned long long& end,
const unsigned long long& step,
vector<char> &is_prime)
{
for (unsigned long long i = start; i <= end; i += step)
if (is_prime[i])
for (unsigned long long j = i + i; j <= l; j += i)
is_prime[j] = 0;
}
void runSieve(const unsigned long long& l)
{
vector<char> is_prime(l + 1, 1);
unsigned long long end = sqrt(l);
processSieve(l, 2, end, 1, is_prime);
primeContainer.clear();
for (unsigned long long i = 1; i <= l; ++i)
if (is_prime[i])
primeContainer.insert(i);
}
void runSieveThreads(const unsigned long long& l)
{
vector<char> is_prime(l + 1, 1);
unsigned long long end = sqrt(l);
vector<thread> threads;
threads.reserve(cpuCount);
for (unsigned long long i = 0; i < cpuCount; ++i)
threads.emplace_back(processSieve, l, 2 + i, end, cpuCount, ref(is_prime));
for (unsigned long long i = 0; i < cpuCount; ++i)
threads[i].join();
primeContainer.clear();
for (unsigned long long i = 1; i <= l; ++i)
if (is_prime[i])
primeContainer.insert(i);
}
The measurement results, primes up to 1 000 000 (MSVC 2013, Release):
runLoop: 204.02 ms
runLoopThread: 43947.4 ms
runSieve: 30.003 ms
runSieveThreads (8 cores): 24.0024 ms
Up to 10 0000 000:
runLoop: 4387.44 ms
// runLoopThread disabled, taking too long
runSieve: 350.035 ms
runSieveThreads (8 cores): 285.029 ms
The times include final processing of the vector and pushing the results to the prime set.
As you can see, the Sieve version is much faster than your version even in single threaded version (for your mutex version I had to change the lock to regular mutex locks because the MSVC 2013 does not have shared_lock, so the results are likely much worse than yours).
But you can see that the multithreaded version of the sieve still does not run as fast as expected (8 cores, i.e. 8 threads, linear speedup would be 8x faster than single thread), although there is no locking (trading off that some numbers might run unnecessarily if they were not marked as "no primes" by the other threads yet, but in general the results should be stable, because only set to 0 every time, does not matter if set simultaneously by multiple threads). The cause why the speedup is not linear is most likely because of the "false sharing" problem as I mentioned before - the threads writing the zeros invalidate each other cache lines.
Sorting seems to be slower with 2 threads instead of 1
How can this be possible?
Unlike C++, Python is quite difficult to parallelize because of the GIL.
While collections.deque
's append
and popleft
are thread-safe, this does not guarantee that they'll perform well in a non-serial paradigm.
Is it linked to this question?
No. The GIL is a property of CPython. It is totally disjoint from false sharing.
It takes way longer (almost 2 twice as long) with two threads than it does when I do it normally.
This is because the GIL doesn't support shared memory multithreading. As such, you're essentially running your code serially twice.
Multi-Threaded program runs slower than single threaded
I understand that the result will not be correct but it is crawling compared to single threaded one
I kill the program after 5-6 minutes of execution since it is running very slowly
First off, I assume you are using a Executors.newFixedThreadPool()
which allocates a fixed number of threads and not a cached thread pool.
It seems to me that you may be creating some huge number of jobs and you program is running out of memory. As you fill up memory the JVM works harder and harder on GC which shows up as your process getting slower and slower. You could connect to the application using jconsole to verify the number of threads and the memory. You could also do a thread-dump on it (kill -QUIT pid
) and see how many jobs you have forked.
If you are creating some huge number of jobs and your ExecutorService
just can't keep up, then you will need to throttle the job production. There are a couple different ways to do that. Here's what I use:
Process Large File for HTTP Calls in Java
Couple other solutions linked from there.
I thought creating threads itself might have overhead reduced the pool size to 4 (I have an eight core processor) but that did nothing.
Yeah this doesn't seem like a processor issue. I would move it back to 8. If this really makes the box unusable then try 7 or 6 threads in your ExecutorService
.
Edit:
After looking at the code more, you are doing a bunch of unsynchronized data updating that is going to cause strange results. Anytime you modify shared memory (in your case shared static
fields) then you are going to have to have some synchronization both in terms of mutex (++
) and memory sharing.
I'd consider using AtomicLong
and others if you can but you should read some threading tutorials on shared memory and synchronization: http://docs.oracle.com/javase/tutorial/essential/concurrency/sync.html
C++ multithreading performance slower than single threaded code
You did not pass any std::launch policy
to std::async
, so it leaves the implementation a lot of freedom.
Behaves as if (2) is called with policy being std::launch::async | std::launch::deferred. In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.
But also be aware that more generally, using more threads, especially on small tasks may not be faster.
- Where
dist_calculation
or any task you want to thread is a small amount of work, be aware of the overheads. Creating a new thread has a relatively high cost, and there is also overhead for whatever internal poolstd::async
uses, promises, and futures. - Additionally, the way written, it is likely you create more vectors, with more dynamic memory, and you will need to merge the results which will also have some cost.
- In more complex cases, if synchronization, e.g. with
std::mutex
is involved, that may cost more performance than additional threads gain. - In some cases, the bottleneck will not be the CPU. For example it may be disk/storage speed (including for page/swap file), network speed (including remote servers), or even memory bandwidth (excepting NUMA aware optimizations, they are a lot more complex than just using
std::async
). Multithreading in these will just add overheads, but no benefit.
You should make use of other basic optimisations where possible first, such as reserve
the size of the vectors to avoid unneeded allocations and copies, maybe resize
and use the vector[index] = a
instead of push_back
, etc.
For something as simple as abs(centre - input[i])
you might get a lot more improvement from SIMD (Single instruction multiple data) optimisations. e.g. make sure you are compiling with any optimizations such as SSE2
enabled, and if the compiler doesn't optimise the loop appropriately (I think the push_back
may interfere, test!), alter it slightly so it does, or maybe even use the vector instructions explicitly (for x86 check out _mm_add_epi32
etc.).
Related Topics
Simple For() Loop Benchmark Takes the Same Time with Any Loop Bound
How to Increase Error Limit in Visual Studio
What Is the Meaning of Double Curly Braces Initializing a C-Struct
How to Output Array of Doubles to Hard Drive
What Is the Practical Use of Pointers to Member Functions
Array Decay to Pointer and Overload Resolution
Explicit Call to Destructor Is Not Destroying My Object Why
Check If a Type Is Passed in Variadic Template Parameter Pack
Opencv Cv::Mat to Std::Ifstream for Base64 Encoding
Extra Leading Zeros When Printing Float Using Printf
What Are the Reasons That Extending the Std Namespace Is Considered Undefined Behavior
Calling Delete on Null Pointers - C++03 VS C++11
Boost::Spirit How to Parse and Call C++ Function-Like Expressions