Profiling Arbitrary Cuda Applications

Is there some in-code profiling of CUDA program

For quick, lightweight timing, you may want to have a look at the cudaEvent API.

Excerpt from the link above:

cudaEvent_t start, stop;

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

printf("Elapsed time: %f ms\n", milliseconds);

If you want a more full-featured profiling library, you should look at CUPTI.

cuFFT profiling issue

According to the nvprof documentation, api-trace-mode:

API-trace mode shows the timeline of all CUDA runtime and driver API calls

cuFFT is neither the CUDA runtime API nor the CUDA driver API. It is a library of routines for FFT, whose documentation is here.

You can still use either nvprof, the command line profiler, or the visual profiler, to gather data about how cuFFT uses the GPU, of course.

Any CUDA operation after cudaStreamSynchronize blocks until all streams are finished

I haven't found why the operations are blocking but I concluded that I can not do anything about it so I decided ti implement memory and streams pooling (as suggested in comments) to re-use GPU memory, pinned CPU memory and streams to avoid any kind of deletion.

In case anybody would be interested here is my solution. The start kernel behaves as asynchronous operation that schedules kernel and callback is called after the kernel is finished.

std::vector<Instance*> m_idleInstances;
std::vector<Instance*> m_workingInstances;

void startKernelAsync(...) {
// Search for finished stream.
while (m_idleInstances.size() == 0) {
if (m_idleInstances.size() == 0) {
std::chrono::milliseconds dur(10);

Instance* instance = m_idleInstances.back();

// Fill CPU pinned memory

cudaMemcpyAsync(..., stream);
runKernel<<<32, 32, 0, stream>>>(gpuMemory);
cudaMemcpyAsync(..., stream);


void findFinishedInstance() {
for (auto it = m_workingInstances.begin(); it != m_workingInstances.end();) {
Instance* inst = *it;
cudaError_t status = cudaStreamQuery(inst->stream);
if (status == cudaSuccess) {
it = m_workingInstances.erase(it);
else {

And at the and just wait for everybody to finish:

virtual void waitForFinish() {
while (m_workingInstances.size() > 0) {
Instance* instance = m_workingInstances.back();

And here is a graph form profiler, works as a charm!

Graph of streams

CUDA development on different cards?

Yes, it's a pretty common practice to use different GPUs for development and production. nVidia GPU generations are backward-compatible, so if your program runs on older card (that is if 320M (CC1.3)), it would certainly run on M2070 (CC2.0)).

If you want to get maximum performance, you should, however, profile your program on same architecture you are going to use it, but usually everything works quite well without any changes when moving from 1.x to 2.0. Any emulator provide much worse view of what's going on than running on no-matter-how-old GPU.

Regarding recursion: an attempt to compile a program with obvious recursion for 1.3 architecture produces compile-time error:

nvcc -arch=sm_13
./ Error: Recursive function call is not supported yet: factorial(int)

In more complex cases the program might compile (I don't know how smart the compiler is in detecting recursions), but certainly won't work: in 1.x architecture there was no call stack, and all function calls were actually inlined, so recursion is technically impossible.

However, I would strongly recommend you to avoid recursion at any cost: it goes against GPGPU programming paradigm, and would certainly lead to very poor performance. Most algorithms are easily rewritten without the use of recursion, and it is much more preferable way to utilize them, not only on GPU, but on CPU as well.

Related Topics

Leave a reply