C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)
user 0m45.735s
clock()
measures CPU time the process used (as good as it can) per 7.27.2.1
The clock function returns the implementation’s best approximation to the processor time used by the program since the beginning of an implementation-defined era related only to the program invocation.
and not wall clock time. Thus clock()
reporting a time close to the user
time that time
reports is normal and standard-conforming.
To measure elapsed time, if you can assume POSIX, using clock_gettime
is probably the best option, the standard function time()
can also be used for that, but is not very fine-grained.
OpenMP takes more time in my program with gcc 5.4
You should use omp_get_wtime() instead to measure the wall-clock:
double dif;
double start = omp_get_wtime( ); //start the timer
//beginning of computation
..
//end of computation
double end = omp_get_wtime();// end the timer
dif = end - start // stores the difference in dif
printf("the time of dif is %f", dif);
C++ ctime.h will not correctly calculate time
The clock
function actually measures the time you spend actively on the CPU, not the wall time. It is not very useful in your case because it measures the combined CPU time of all threads and it is usually more than the wall time.
If you do not need high time resolution, you can use time
function that measures wall time, but has one second resolution. If you need more precise timing, you can take a look at this answer.
Parallel exection using OpenMP takes longer than serial execution c++, am i calculating execution time in the right way?
OpenMP internally implement multithreading for parallel processing and multi threading's performance can be measured with large volume of data. With very small volume of data you cannot measure the performance of multithreaded application. The reasons:-
a) To create a thread O/S need to allocate memory to each thread which take time (even though it is tiny bit.)
b) When you create multi threads it needs context switching which also take time.
c) Need to release memory allocated to threads which also take time.
d) It depends on number of processors and total memory (RAM) in your machine
So when you try with small operation with multi threads it's performance will be as same as a single thread (O/S by default assign one thread to every process which is call main thread). So your outcome is perfect in this case. To measure the performance of multithread architecture use large amount of data with complex operation then only you can see the differences.
Why my c program took time more than the time calculated by itself?
There always is overhead for starting up the process, starting the runtime, closing the program and time itself probably also has overhead.
On top of that, in a multi-process operating system your process can be "switched-out", meaning that other processes run while yours in put on hold. This can mess with timings too.
Let me explain the output of time:
real
means the actual clock time, including all overhead.user
is time spent in the actual program.sys
is time spent by the kernel system (the switching out I talked about earlier, for example)
Note that user + sys
is very close to your time: 1m7.607s + 0m3.785s == 71.392s
.
Finally, how did you calculate the time? Without that information it's hard to tell exactly what the problem (if any) is.
OpenMP parallel prefix sum speedup
First of all, just to be sure, since you state that htop
shows that a single core is being used, make sure that you have enabled OpenMP support in your compiler. The option to do so is -fopenmp
for GCC, -xopenmp
for Sun/Oracle compilers and -openmp
for Intel compilers.
Second of all, n = 20
might be too low of a cut off for the parallel implementation. A shameless plug - see this course material from a workshop on OpenMP that a colleague of mine gave some months ago. Several parallel versions with tasking are discussed there, starting from slide 20.
Third, ptime
is a Solaris command, not specific to SPARC since it is also available in the x86 version. Many process related Solaris commands have the p
prefix in their names. Note that in your case time
is more likely to be the built-in implementation that Bash provides rather than the standalone binary.
Fourth, and may be the real answer to your question - you are missing a parallel
region in your code so the task directives don't work at all :) You should rewrite your code as following:
long comp_fib_numbers(int n)
{
long fnm1, fnm2, fn;
if ( n == 0 || n == 1 ) return(n);
// In case the sequence gets too short, execute the serial version
if ( n < 20 )
{
return(comp_fib_numbers(n-1)+comp_fib_numbers(n-2));
}
else
{
#pragma omp parallel // <--- You are missing this one parallel region
{
#pragma omp single
{
#pragma omp task shared(fnm1)
fnm1 = comp_fib_numbers(n-1);
#pragma omp task shared(fnm2)
fnm2 = comp_fib_numbers(n-2);
}
#pragma omp taskwait
}
fn = fnm1 + fnm2;
return(fn);
}
}
You could make the code even more terse by using the if
clause to switch of the parallel region:
long comp_fib_numbers(int n)
{
long fnm1, fnm2, fn;
if ( n == 0 || n == 1 ) return(n);
#pragma omp parallel if(n >= 20)
{
#pragma omp single
{
#pragma omp task shared(fnm1)
fnm1 = comp_fib_numbers(n-1);
#pragma omp task shared(fnm2)
fnm2 = comp_fib_numbers(n-2);
}
#pragma omp taskwait
}
fn = fnm1 + fnm2;
return(fn);
}
If n
happens to be less than 20, then the parallel region would execute single threaded. Since parallel regions are usually extracted in separate functions, there would still be an additional function call, unless the compiler choses to produce duplicate code. That's why it is recommended that the serial implementation is extracted in its own function:
long comp_fib_numbers_serial(int n)
{
if ( n == 0 || n == 1 ) return(n);
return (comp_fib_numbers_serial(n-1) + comp_fib_numbers_serial(n-2));
}
long comp_fib_numbers(int n)
{
long fnm1, fnm2, fn;
if ( n < 20 ) return comp_fib_numbers_serial(n);
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(fnm1)
fnm1 = comp_fib_numbers(n-1);
#pragma omp task shared(fnm2)
fnm2 = comp_fib_numbers(n-2);
}
#pragma omp taskwait
}
fn = fnm1 + fnm2;
return(fn);
}
Edit: Now that I've looked at the code that you have linked to, I can see that the call to comp_fib_numbers
is embedded into a parallel
region. So just disregard my comment about the missing parallel
region if you already have one in your code. I will leave it here just for completeness. Try tweaking the value at which the switch between the parallel and the serial version occurs. On modern processors it might be quite high and the example that you have seen is quite old. Also make sure that no dynamic teams are used by either setting the environment variable OMP_DYNAMIC
to false
(or to FALSE
) or by calling omp_set_dynamic(0);
someplace before the parallel region.
You haven't stated what your compiler is but mind that OpenMP 3.0 is supported by GCC since version 4.4, by Intel compilers since version 11.0, by Sun/Oracle compilers since version I_dont_know and is not supported at all by the Visual C/C++ compilers.
Observed speedup on an quad-socket Intel Xeon X7350 system (old pre-Nehalem system with FSB)
$ time OMP_NUM_THREADS=1 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=1 ./fib.x 40 1.86s user 0.00s system 99% cpu 1.866 total
$ time OMP_NUM_THREADS=2 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=2 ./fib.x 40 1.96s user 0.00s system 169% cpu 1.161 total
With the cut-off set to 25
(seems to be the optimal value for the X7350):
$ time OMP_NUM_THREADS=2 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=2 ./fib.x 40 1.95s user 0.00s system 169% cpu 1.153 total
With the cut-off set to 25
and a separate function for the serial implementation:
$ time OMP_NUM_THREADS=2 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=2 ./fib.x 40 1.52s user 0.00s system 171% cpu 0.889 total
See how the user time decreases by some 400 ms. This is because of the removed overhead.
These were measured with the code from the site that you have linked to. Used compiler is GCC 4.4.6 on a 64-bit Scientific Linux 6.2 system.
Related Topics
How to Detect Negative Numbers as Parsing Errors When Reading Unsigned Integers
C++ Access Violation Reading Location 0Xcdcdcdcd Error on Calling a Function
Move Element from Boost Multi_Index Array
How to Order the Members of a C++ Class
Check Keypress in C++ on Linux
Boost::Asio Synchronous Client with Timeout
Sf::Texture as Class Member Doesn't Work
How to Write and Read To/From a Qresource File in Qt 5
Parallel for VS Omp Simd: When to Use Each
Should I Use a C++ Reinterpret_Cast Over a C-Style Cast
How to Invoke a User-Defined Conversion Function via List-Initialization
Why Is the Console Closing After I'Ve Included Cin.Get()
Auto Keyword Behavior with References
Inheriting Constructors and Brace-Or-Equal Initializers
Vs: Unexpected Optimization Behavior with _Bitscanreverse64 Intrinsic
In C++, Can a Class with a Const Data Member Not Have a Copy Assignment Operator