C++: Timing in Linux (Using Clock()) Is Out of Sync (Due to Openmp)

C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)

user 0m45.735s

clock() measures CPU time the process used (as good as it can) per 7.27.2.1

The clock function returns the implementation’s best approximation to the processor time used by the program since the beginning of an implementation-defined era related only to the program invocation.

and not wall clock time. Thus clock() reporting a time close to the user time that time reports is normal and standard-conforming.

To measure elapsed time, if you can assume POSIX, using clock_gettime is probably the best option, the standard function time() can also be used for that, but is not very fine-grained.

OpenMP takes more time in my program with gcc 5.4

You should use omp_get_wtime() instead to measure the wall-clock:

double  dif;
double start = omp_get_wtime( ); //start the timer
//beginning of computation
..
//end of computation
double end = omp_get_wtime();// end the timer
dif = end - start // stores the difference in dif
printf("the time of dif is %f", dif);

C++ ctime.h will not correctly calculate time

The clock function actually measures the time you spend actively on the CPU, not the wall time. It is not very useful in your case because it measures the combined CPU time of all threads and it is usually more than the wall time.

If you do not need high time resolution, you can use time function that measures wall time, but has one second resolution. If you need more precise timing, you can take a look at this answer.

Parallel exection using OpenMP takes longer than serial execution c++, am i calculating execution time in the right way?

OpenMP internally implement multithreading for parallel processing and multi threading's performance can be measured with large volume of data. With very small volume of data you cannot measure the performance of multithreaded application. The reasons:-

a) To create a thread O/S need to allocate memory to each thread which take time (even though it is tiny bit.)

b) When you create multi threads it needs context switching which also take time.

c) Need to release memory allocated to threads which also take time.

d) It depends on number of processors and total memory (RAM) in your machine

So when you try with small operation with multi threads it's performance will be as same as a single thread (O/S by default assign one thread to every process which is call main thread). So your outcome is perfect in this case. To measure the performance of multithread architecture use large amount of data with complex operation then only you can see the differences.

Why my c program took time more than the time calculated by itself?

There always is overhead for starting up the process, starting the runtime, closing the program and time itself probably also has overhead.

On top of that, in a multi-process operating system your process can be "switched-out", meaning that other processes run while yours in put on hold. This can mess with timings too.

Let me explain the output of time:

  • real means the actual clock time, including all overhead.
  • user is time spent in the actual program.
  • sys is time spent by the kernel system (the switching out I talked about earlier, for example)

Note that user + sys is very close to your time: 1m7.607s + 0m3.785s == 71.392s.

Finally, how did you calculate the time? Without that information it's hard to tell exactly what the problem (if any) is.

OpenMP parallel prefix sum speedup

First of all, just to be sure, since you state that htop shows that a single core is being used, make sure that you have enabled OpenMP support in your compiler. The option to do so is -fopenmp for GCC, -xopenmp for Sun/Oracle compilers and -openmp for Intel compilers.

Second of all, n = 20 might be too low of a cut off for the parallel implementation. A shameless plug - see this course material from a workshop on OpenMP that a colleague of mine gave some months ago. Several parallel versions with tasking are discussed there, starting from slide 20.

Third, ptime is a Solaris command, not specific to SPARC since it is also available in the x86 version. Many process related Solaris commands have the p prefix in their names. Note that in your case time is more likely to be the built-in implementation that Bash provides rather than the standalone binary.

Fourth, and may be the real answer to your question - you are missing a parallel region in your code so the task directives don't work at all :) You should rewrite your code as following:

long comp_fib_numbers(int n)
{
long fnm1, fnm2, fn;
if ( n == 0 || n == 1 ) return(n);

// In case the sequence gets too short, execute the serial version
if ( n < 20 )
{
return(comp_fib_numbers(n-1)+comp_fib_numbers(n-2));
}
else
{
#pragma omp parallel // <--- You are missing this one parallel region
{
#pragma omp single
{
#pragma omp task shared(fnm1)
fnm1 = comp_fib_numbers(n-1);
#pragma omp task shared(fnm2)
fnm2 = comp_fib_numbers(n-2);
}
#pragma omp taskwait
}

fn = fnm1 + fnm2;
return(fn);
}

}

You could make the code even more terse by using the if clause to switch of the parallel region:

long comp_fib_numbers(int n)
{
long fnm1, fnm2, fn;
if ( n == 0 || n == 1 ) return(n);

#pragma omp parallel if(n >= 20)
{
#pragma omp single
{
#pragma omp task shared(fnm1)
fnm1 = comp_fib_numbers(n-1);
#pragma omp task shared(fnm2)
fnm2 = comp_fib_numbers(n-2);
}
#pragma omp taskwait
}

fn = fnm1 + fnm2;
return(fn);
}

If n happens to be less than 20, then the parallel region would execute single threaded. Since parallel regions are usually extracted in separate functions, there would still be an additional function call, unless the compiler choses to produce duplicate code. That's why it is recommended that the serial implementation is extracted in its own function:

long comp_fib_numbers_serial(int n)
{
if ( n == 0 || n == 1 ) return(n);

return (comp_fib_numbers_serial(n-1) + comp_fib_numbers_serial(n-2));
}

long comp_fib_numbers(int n)
{
long fnm1, fnm2, fn;
if ( n < 20 ) return comp_fib_numbers_serial(n);

#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(fnm1)
fnm1 = comp_fib_numbers(n-1);
#pragma omp task shared(fnm2)
fnm2 = comp_fib_numbers(n-2);
}
#pragma omp taskwait
}

fn = fnm1 + fnm2;
return(fn);
}

Edit: Now that I've looked at the code that you have linked to, I can see that the call to comp_fib_numbers is embedded into a parallel region. So just disregard my comment about the missing parallel region if you already have one in your code. I will leave it here just for completeness. Try tweaking the value at which the switch between the parallel and the serial version occurs. On modern processors it might be quite high and the example that you have seen is quite old. Also make sure that no dynamic teams are used by either setting the environment variable OMP_DYNAMIC to false (or to FALSE) or by calling omp_set_dynamic(0); someplace before the parallel region.

You haven't stated what your compiler is but mind that OpenMP 3.0 is supported by GCC since version 4.4, by Intel compilers since version 11.0, by Sun/Oracle compilers since version I_dont_know and is not supported at all by the Visual C/C++ compilers.

Observed speedup on an quad-socket Intel Xeon X7350 system (old pre-Nehalem system with FSB)

$ time OMP_NUM_THREADS=1 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=1 ./fib.x 40 1.86s user 0.00s system 99% cpu 1.866 total
$ time OMP_NUM_THREADS=2 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=2 ./fib.x 40 1.96s user 0.00s system 169% cpu 1.161 total

With the cut-off set to 25 (seems to be the optimal value for the X7350):

$ time OMP_NUM_THREADS=2 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=2 ./fib.x 40 1.95s user 0.00s system 169% cpu 1.153 total

With the cut-off set to 25 and a separate function for the serial implementation:

$ time OMP_NUM_THREADS=2 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=2 ./fib.x 40 1.52s user 0.00s system 171% cpu 0.889 total

See how the user time decreases by some 400 ms. This is because of the removed overhead.

These were measured with the code from the site that you have linked to. Used compiler is GCC 4.4.6 on a 64-bit Scientific Linux 6.2 system.



Related Topics



Leave a reply



Submit