How to Profile Multi-Threaded C++ Application on Linux

How to profile multi-threaded C++ application on Linux?

Edit: added another answer on poor man's profiler, which IMHO is better for multithreaded apps.

Have a look at oprofile. The profiling overhead of this tool is negligible and it supports multithreaded applications---as long as you don't want to profile mutex contention (which is a very important part of profiling multithreaded applications)

Does gprof support multithreaded applications?

Unless you change the processing the gprof would work fine.

Changing the processing means using co-processor or gpus as computing units. In the worst case you have to manually call the setitimer function for every thread. But as per latest version, (2013-14) it's not needed.

In certain cases it behaves mischievously. So I advice to use the VTUNE from Intel which would give more accurate and more detailed information.

Profiling C++ multi-threaded applications

Following are the good tools for multithreaded applications. You can try evaluation copy.

  1. Runtime sanity check tool

    • Thread Checker -- Intel Thread checker / VTune, here
  2. Memory consistency-check tools (memory usage, memory leaks)
    - Memory Validator, here
  3. Performance Analysis. (CPU usage)
    - AQTime , here

EDIT: Intel thread checker can be used to diagnose Data races, Deadlocks, Stalled threads, abandoned locks etc. Please have lots of patience in analyzing the results as it is easy to get confused.

Few tips:

  1. Disable the features that are not required.(In case of identifying deadlocks, data race can be disabled and vice versa.)
  2. Use Instrumentation level based on your need. Levels like "All Function" and "Full Image" are used for data races, where as "API Imports" can be used for deadlock detection)
  3. use context sensitive menu "Diagnostic Help" often.

Need thoughts on profiling of multi-threading in C on Linux

Since increase in the size of the input data may also increase in the
memory requirement of each thread, then so loading all data in advance
is definitely not a workable option. Therefore, in order to ensure not
to increase the memory requirement of each thread, each thread reads
data in small chunks, process it and reads next chunk process it and
so on.

Just this, alone, can cause a drastic speed decrease.

If there is sufficient memory, reading one large chunk of input data will always be faster than reading data in small chunks, especially from each thread. Any I/O benefits from chunking (caching effects) disappears when you break it down into pieces. Even allocating one big chunk of memory is much cheaper than allocating small chunks many, many times.

As a sanity check, you can run htop to ensure that at least all your cores are being topped out during the run. If not, your bottleneck could be outside of your multi-threading code.

Within the threading,

  • threading context switches due to many threads can cause sub-optimal speedup
  • as mentioned by others, a cold cache due to not reading memory contiguously can cause slowdowns

But re-reading your OP, I suspect the slowdown has something to do with your data input/memory allocation. Where exactly are you reading your data from? Some kind of socket? Are you sure you need to allocate memory more than once in your thread?

Some algorithm in your worker threads is likely to be suboptimal/expensive.

Automatic use of multi-threading in Linux

You could try the gcc flags to auto-parallelize loops (-floop-parallelize-all -ftree-parallelize-loops=8) which uses pthreads. You have to be careful how you write your code of course, the compiler has to know there's no dependance between each iteration of your loop in order to be able to parallelize it.

But to be honest, you get nothing for free, unless your code is designed for multiple processors then you will never gain much.

Can I take advntage of multi core in a multi-threaded application that I develop

You don't need to do anything. Create as many threads as you want and the OS will schedule them together with the threads from all the other processes over every available cores.



Related Topics



Leave a reply



Submit