Trouble Launching Cuda Kernels from Static Initialization Code

Trouble launching CUDA kernels from static initialization code

The short version:

The underlying reason for the problem when class A is instantiated outside of main is that a particular hook routine which is required to initialise the CUDA runtime library with your kernels is not being run before the constructor of class A is being called. This happens because there are no guarantees about the order in which static objects are instantiated and initialised in the C++ execution model. Your global scope class is being instantiated before the global scope objects which do the CUDA setup are initialised. Your kernel code is never being loaded into the context before it is call, and a runtime error results.

As best as I can tell, this is a genuine limitation of the CUDA runtime API and not something easily fixed in user code. In your trivial example, you could replace the kernel call with a call to cudaMemset or one of the non-symbol based runtime API memset functions and it will work. This problem is completely limited to user kernels or device symbols loaded at runtime via the runtime API. For this reason, an empty default constructor would also solve your problem. From a design point of view, I would be very dubious of any pattern which calls kernels in the constructor. Adding a specific method for class GPU setup/teardown which doesn't rely on the default constructor or destructor would be a much cleaner and less error prone design, IMHO.

In detail:

There is an internally generated routine (__cudaRegisterFatBinary) which must be run to load and register kernels, textures and statically defined device symbols contained in the fatbin payload of any runtime API program with the CUDA driver API before the kernel can be called without error. This is a part of the "lazy" context initialisation feature of the runtime API. You can confirm this for yourself as follows:

Here is a gdb trace of the revised example you posted. Note I insert a breakpoint into __cudaRegisterFatBinary, and that isn't reached before your static A constructor is called and the kernel launch fails:

talonmies@box:~$ gdb a.out 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/talonmies/a.out...done.
(gdb) break '__cudaRegisterFatBinary' 
Breakpoint 1 at 0x403180
(gdb) run
Starting program: /home/talonmies/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Scalar Field
[New Thread 0x7ffff5a63700 (LWP 10774)]
Class A
Kernel : invalid device function 
[Thread 0x7ffff5a63700 (LWP 10774) exited]
[Inferior 1 (process 10771) exited with code 0377]

Here is the same procedure, this time with A instantiation inside main (which is guaranteed to happen after the objects which perform lazy setup have been initialised):

talonmies@box:~$ cat main.cu
#include "classA.h"

int main() {
    A a_object;
    std::cout << "Main" << std::endl;
    return 0;
}

talonmies@box:~$ nvcc --keep -arch=sm_30 -g main.cu
talonmies@box:~$ gdb a.out 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/talonmies/a.out...done.
(gdb) break '__cudaRegisterFatBinary' 
Breakpoint 1 at 0x403180
(gdb) run
Starting program: /home/talonmies/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, 0x0000000000403180 in __cudaRegisterFatBinary ()
(gdb) cont
Continuing.
Scalar Field
[New Thread 0x7ffff5a63700 (LWP 11084)]
Class A
Main
[Thread 0x7ffff5a63700 (LWP 11084) exited]
[Inferior 1 (process 11081) exited normally]

If this is really a crippling problem for you, I would suggest contacting NVIDIA developer support and raising a bug report.

Static device variable and kernels in separate file

Your code structure, where device code in one compilation unit references device code or device entities in another compilation unit, will require CUDA relocatable device code compilation and linking.

In the case of __device__ variables such as what you have here:

Add -rdc=true to enable this, to your nvcc compilation command line
Add extern in front of the definition of devData, in functionsGPU.cuh
Add __device__ float devData; to staticGlobalMemory.cu

In the case of linking to a __device__ function in a separate file, along with providing the prototype typically via a header file like you would with any function in C++, you also need to add -rdc=true to your nvcc compilation command line, to enable device code linking. Steps 2 and 3 above are not needed.

That should fix the issue. Step 1 provides the necessary cross-module linkage, and steps 2 and 3 will fix the duplicate definition problem you would have, since you are including the same variable via a header file in separate compilation units.

For a reference of how to do the device code compilation setting in windows visual studio, see here.

Accessing class data members from within cuda kernel - how to design proper host/device interaction?

Your approach should be workable. When you pass an object by value as a kernel parameter (as you have indicated) there really isn't much setup that needs to be done associated with the transfer from host to device.

You need to properly allocate data on the host and the device, and use cudaMemcpy type operations at appropriate points to move the data, just as you would in an ordinary CUDA program.

One thing to be aware of when declaring an object at global scope as you have done, is that it is recommended not to use CUDA API calls in the object's constructor or destructor. The reasons are covered here, I won't repeat them here. Although that treatment mostly focuses on kernels launched before main, the CUDA lazy initialization can also impact any CUDA API call that is executed outside of main scope, which applies to constructors and destructors of objects instantiated at global scope.

What follows is a fleshed out example from what you have shown. I mostly didn't change the code you had already written, just added some method definitions for the ones you hadn't. There's obviously a lot of different possible approaches here. For more examples you might want to look at the CUDA C++ integration sample code.

Here's a worked example around what you have shown:

$ cat t1236.cu
#include <cstdio>

class myClass
{
public:
        bool bool_var;    // Set from host and readable from device
        int  data_size;   // Set from host
        __host__ myClass();
        __host__ ~myClass();
        __host__ void setValues(bool iftrue, int size);
        __device__ void dosomething(int device_parameter);
        __host__ void export_data();

        // completely unknown methods
        __host__ void prepareDeviceObj();
        __host__ void retrieveDataToHost();
private:
        int *data; // Filled in device, shared between threads, at the end copied back to host for data output
        int *h_data;
};

__host__ myClass::myClass()
{
}

__host__ myClass::~myClass()
{
}

__host__ void myClass::prepareDeviceObj(){
        cudaMemcpy(data, h_data, data_size*sizeof(h_data[0]), cudaMemcpyHostToDevice);
}
__host__ void myClass::retrieveDataToHost(){
        cudaMemcpy(h_data, data, data_size*sizeof(h_data[0]), cudaMemcpyDeviceToHost);
}

__host__ void myClass::setValues(bool iftrue, int size)
{
        bool_var  = iftrue;
        data_size = size;
        cudaMalloc(&data, data_size*sizeof(data[0]));
        h_data = (int *)malloc(data_size*sizeof(h_data[0]));
        memset(h_data, 0, data_size*sizeof(h_data[0]));
}

__device__ void myClass::dosomething(int idx)
{
        int toadd = idx+data_size;
        atomicAdd(&(data[idx]), toadd); // data should be unique among threads
}
__host__ void myClass::export_data(){
        for (int i = 0; i < data_size; i++) printf("%d ", h_data[i]);
        printf("\n");
        cudaFree(data);
        free(h_data);
}

__global__ void myKernel(myClass obj)
{
        const int idx = blockIdx.x*blockDim.x + threadIdx.x;
        if(idx < obj.data_size)
        {
                if(!obj.bool_var)
                        printf("Object is not up to any task here!");
                else
                {
                        //printf("Object is ready!");
                        obj.dosomething(idx);
                }
        }
}

myClass globalInstance;

int main(int argc, char** argv)
{
        int some_number = 40;
        globalInstance.setValues(true, some_number);
        globalInstance.prepareDeviceObj();
        myKernel<<<1,some_number>>>(globalInstance);
        globalInstance.retrieveDataToHost();
        globalInstance.export_data();
        exit(EXIT_SUCCESS);
}
$ nvcc -o t1236 t1236.cu
$ cuda-memcheck ./t1236
========= CUDA-MEMCHECK
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
========= ERROR SUMMARY: 0 errors
$

cuda header files

I expect that you have a rule that automatically compiles all .cu files, meaning KernelUtil.cu is effectively compiled twice, once on its own and once when included in main.cu, and therefore add is duplicated.

Try renaming KernelUtil.cu to KernelUtil.h (or .cuh).

How to declare a static variable in cuda

For the use case you describe, a static variable that is accessible from device code is created using the __device__ qualifier. Refer to the documentation.

In addition, for the use case you describe (inter-thread/block/kernel communication) I would also mark that variable with the volatile qualifier. Refer to the documentation.

Something like this:

__device__ volatile int found = 0;

__device__ volatile bool found = false;

Here is an example from the programming guide that uses this construct for inter-thread communication.

You can then set that variable to 1 or true from any thread, and later query it for status.

Since your use-case description only involves setting the variable to a single value, regardless of which thread does it, there is no concern about simultaneous access from multiple threads, as long as the only operation you do is a write to that variable:

found = 1;

(and for this specific case, volatile may not be necessary either, depending on your exact usage.)

If you know that only one thread will find the item, and you also wish to record the x,y coordinates, that would be a trivial extension:

__device__ volatile int found = 0;
__device__ volatile int x = -1;
__device__ volatile int y = -1;

then your device code could be:

if (item_found){
  found = 1;
  x = item.x;
  y = item.y;}

Why is CUDA crashing upon accessing member of class?

You may have some misconceptions about how managed memory, or cudaMallocManaged works. In some respects, cudaMallocManaged is conceptually like the C-library malloc, except that it allocates managed memory. For a more detailed introduction to the use of managed memory in a simple CUDA application, you may want to refer to this blog.

Specifically, here are some issues with your code:

When using a managed allocation for an array of objects that have embedded pointers, all levels of allocation need to be replaced with a managed allocation, if you intend to use that on the device. Therefore, using new in the constructor won't work, if you want to access that member/field in device code. We can substitute cudaMallocManaged there.
We don't allocate a pointer using new then re-allocate the same pointer with cudaMallocManaged.
The sizeof function in C (or C++) when used on a pointer will return the size of that pointer, not the size of whatever it points to. So this isn't a sensible way to allocate for an array of objects.
CUDA kernel launches are asynchronous, so after a kernel launch, if you want to use managed data, it will be necessary to create some kind of synchronization.

What follows is a minimal set of modifications to address the above issues. I've omitted proper CUDA error checking but I strongly recommend that when developing CUDA codes. Also I recommend running this code with cuda-memcheck if you have difficulty with it:

$ cat t51.cu
#include <iostream>
using namespace std;
class bulkArray {
public:
    double* value;
    int xSize;

    void init(int xSize) {
        cudaMallocManaged(&value, xSize*sizeof(double));
        this->xSize = xSize;
    }
};

__global__ void addArrays(bulkArray *a, bulkArray *b, bulkArray *c, int N) {
    int id = blockIdx.x*blockDim.x + threadIdx.x;

    if (id < N)
        c->value[id] = a->value[id] + b->value[id];
}

int main() {
    int N = 50000;

    bulkArray *a;
    bulkArray *b;
    bulkArray *c;

    // allocate unified memory.
    cudaMallocManaged(&a, sizeof(bulkArray));
    cudaMallocManaged(&b, sizeof(bulkArray));
    cudaMallocManaged(&c, sizeof(bulkArray));
    a->init(N);
    b->init(N);
    c->init(N);
    // init vectors on host.
    for (int i = 0; i < N; i++) {
        a->value[i] = sin(i) * cos(i);
        b->value[i] = sin(i) * cos(i);
    }

    int blockSize = 1024;
    int gridSize = (int)ceil((float)N / blockSize);

    addArrays << <gridSize, blockSize >> > (a, b, c, N);
    cudaDeviceSynchronize();
    // sum up vector c.
    double sum = 0;
    double sum2 = 0;
    for (int i = 0; i < N; i++) {
        sum += c->value[i];
        sum2 += a->value[i] + b->value[i];
    }

    cout << "Final result: " << sum << " should be: " << sum2 <<  endl;

    cudaFree(a);
    cudaFree(b);
    cudaFree(c);

    return 0;
}
$ nvcc -arch=sm_35 -o t51 t51.cu
$ cuda-memcheck ./t51
========= CUDA-MEMCHECK
Final result: 0.624013 should be: 0.624013
========= ERROR SUMMARY: 0 errors
$

cuda call fails in destructor

[Expanding comments into a summary answer]

Your code is unknowingly relying on undefined behaviour (the order of destruction of translation unit objects) and there is no real workaround other than to explicitly control and lifespan of objects containing CUDA runtime API calls in their destructor, or simply avoid using those API calls in destructors altogether.

In detail:

The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error. C++ doesn't define the order of destruction when objects fall out of scope. Your singleton or object needs to be destroyed before the CUDA context is torn down, but there is no guarantee that will occur. This is effectively undefined behaviour.

You can see a more complete example of what happens (in the context of a kernel launch) in this answer.

initializer not allowed for shared variable for cuda

Static initialization of shared variables is illegal in CUDA. The problem is that the semantics of how every thread should treat static initialization of shared memory is undefined in the programming model. Which thread should do the write? What happens if the value is not uniform between threads? How should the compiler emit code for such a case and how should the hardware run it?

In your nonsensical example you are asking every thread in the block to initialize the same shared variable with a value -- basically a statically compiled memory race.

Trouble Launching Cuda Kernels from Static Initialization Code