Why Openmp Under Ubuntu 12.04 Is Slower Than Serial Version

Why OpenMP under ubuntu 12.04 is slower than serial version

There is nothing wrong with OpenMP in your case. What is wrong is the way you measure the elapsed time.

Using clock() to measure the performance of multithreaded applications on Linux (and most other Unix-like OSes) is a mistake since it does not return the wall-clock (real) time but instead the accumulated CPU time for all process threads (and on some Unix flavours even the accumulated CPU time for all child processes). Your parallel code shows better performance on Windows since there clock() returns the real time and not the accumulated CPU time.

The best way to prevent such discrepancies is to use the portable OpenMP timer routine omp_get_wtime():

double start = omp_get_wtime();
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
double finish = omp_get_wtime();
printf("from omp: %lf\n", finish - start);

For non-OpenMP applications, you should use clock_gettime() with the CLOCK_REALTIME clock:

struct timespec start, finish;
clock_gettime(CLOCK_REALTIME, &start);
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
clock_gettime(CLOCK_REALTIME, &finish);
printf("from omp: %lf\n", (finish.tv_sec + 1.e-9 * finish.tv_nsec) -
(start.tv_sec + 1.e-9 * start.tv_nsec));

simple OpenMP parallel for loop slower than serial computation

As posted by gilles in the comments, the problem was that i was measuring time with clock(), which adds up all the tics of the cores.
with

chrono::high_resolution_clock::now();

i get the expected speed-up.

for me the question is cleared, but maybe we can leave this as an example for future noobs like me to be referred to. If some mod believes otherwhise the post can be eliminated.
Thanks again for the help

Why OpenMP with random number generation is slower than serial code

You're using one random number generator across many threads. Each call for a new random number will block all the other parallel calls until it is finished.

If you would profile the code, chances are all (or most) of the execution time is spent in some form of mutex locking/unlocking. This problem is called contention and your scenario would be a schoolbook example of how to cause it.

If you use std::threads and give each thread a separate rng, you will achieve pretty much 100% parallelization for that part of the code.

Some code to get you started using std::thread below. Note the use of std::ref

#include <array>
using std::array;
#include <cstddef>
using std::size_t;
#include <functional>
using std::ref;
#include <iostream>
using std::cout;
#include <iterator>
using std::iterator_traits;
#include <thread>
using std::thread;
#include <vector>
using std::vector;
#include <random>
using mersenne_twister = std::mt19937;

template<class T, T N>
array<T, N> series_of_numbers()
{
array<T, N> arr;
for(T i=0; i<N; ++i)
arr[i] = i;

return arr;
}

template<class Iterator, class Engine>
void generate_rng(Iterator begin, Iterator end, Engine& engine)
{
std::uniform_real_distribution<double> dist;
for(auto it = begin; it != end; ++it)
*it = dist(engine);
}

int main()
{
const size_t amount_of_random_numbers = 1024;
// Engines
const size_t Nrng = 4;
auto seed_values = series_of_numbers<size_t, Nrng>(); // choose other seeds if you wish
array<mersenne_twister, Nrng> engines;
for(size_t i=0; i<Nrng; ++i)
engines[i].seed(seed_values[i]);

vector<thread> threads;
vector<double> rngs(amount_of_random_numbers);

// relevant iterators with offsets
vector<vector<double>::iterator> begins = { rngs.begin(),
rngs.begin() + amount_of_random_numbers/Nrng,
rngs.begin() + 2*amount_of_random_numbers/Nrng,
rngs.begin() + 3*amount_of_random_numbers/Nrng };

vector<vector<double>::iterator> ends = { rngs.end(),
rngs.end() - 3*amount_of_random_numbers/Nrng,
rngs.end() - 2*amount_of_random_numbers/Nrng,
rngs.end() - amount_of_random_numbers/Nrng };
// create threads
for(size_t n=0; n<Nrng; ++n)
threads.emplace_back(thread(generate_rng<decltype(begins[n]), mersenne_twister>, ref(begins[n]), ref(ends[n]), ref(engines[n])));

// join threads -> this is where the work will be done.
for(size_t n=0; n<Nrng; ++n)
threads[n].join();

// rngs is filled with magical values!
for(auto number : rngs)
std::cout << number << '\n';
}

Live demo at Coliru. And another version where you can actually change the number of threads to any multiple of 4

Why OpenMP is slower than a sequential program for a simple reduction?

As Max Langhof and user463035818 suggested, the program is memory bounded. I changed the program to do something more than accumulation. That is, I changed sum += ary[i] to sum += (pow(ary[i], 1.1) + pow(ary[i], 1.2)) / 100000000.0 and performed the same change in the parallel program and measured the time. The parallel program became 2X faster. If the program is IO bounded, I guess there is not much I can do about it to make it faster with OpenMP. Please let me know if you think otherwise.

OpenMP vectorised code runs way slower than O3 optimized code

The non-OpenMP vectorizer is defeating your benchmark with loop inversion.

Make your function __attribute__((noinline, noclone)) to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.

And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)


After adding the missing #pragma omp simd to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native or -fopenmp), I get 18266, but with -O3 -fopenmp I get avg time 39772.

With the OpenMP vectorized version, if I look at top while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result, triggering page-faults for it, too.)

But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.

So it looks like gcc -O3 performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.

The asm shows the evidence: GCC 9.3 -O3 on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.

.L4:                    # outer loop
movss xmm0, DWORD PTR [rbx+rdx*4]
addss xmm0, DWORD PTR [r13+0+rdx*4] # one scalar operation
mov eax, 500
.L3: # do {
sub eax, 1 # empty inner loop after inversion
jne .L3 # }while(--i);

add rdx, 1
movss DWORD PTR [rcx], xmm0
add rcx, 4
cmp rdx, 67108864
jne .L4

This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.

Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.

There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.


The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)



Looking at memory bandwidth while the benchmark is running:

Running intel_gpu_top -l while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))). IMC is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.

$ intel_gpu_top -l
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
0 0 0 97 0.00 20421 7482 0.00 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 14 99 0.02 19627 6505 0.47 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 20 98 0.02 19625 6516 0.67 0 0 0.00 0 0 0.00 0 0 0.00 0 0
11 10 22 98 0.03 19632 6516 0.65 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 13 99 0.02 19609 6505 0.46 0 0 0.00 0 0 0.00 0 0 0.00 0 0

Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.

But with -O3 defeating the benchmark, with a 1000 repeat count, we see only near-idle levels of main-memory bandwidth.

 Freq MHz      IRQ RC6 Power     IMC MiB/s           RCS/0           BCS/0           VCS/0          VECS/0 
req act /s % W rd wr % se wa % se wa % se wa % se wa
...
8 8 17 99 0.03 365 85 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
9 9 17 99 0.02 349 90 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
4 4 5 100 0.01 303 63 0.25 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 15 100 0.02 345 69 0.43 0 0 0.00 0 0 0.00 0 0 0.00 0 0
10 10 21 99 0.03 350 74 0.64 0 0 0.00 0 0 0.00 0 0 0.00 0 0

vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)

Parallel exection using OpenMP takes longer than serial execution c++, am i calculating execution time in the right way?

OpenMP internally implement multithreading for parallel processing and multi threading's performance can be measured with large volume of data. With very small volume of data you cannot measure the performance of multithreaded application. The reasons:-

a) To create a thread O/S need to allocate memory to each thread which take time (even though it is tiny bit.)

b) When you create multi threads it needs context switching which also take time.

c) Need to release memory allocated to threads which also take time.

d) It depends on number of processors and total memory (RAM) in your machine

So when you try with small operation with multi threads it's performance will be as same as a single thread (O/S by default assign one thread to every process which is call main thread). So your outcome is perfect in this case. To measure the performance of multithread architecture use large amount of data with complex operation then only you can see the differences.



Related Topics



Leave a reply



Submit