Cuda and Classes

CUDA and Classes

Define the class in a header that you #include, just like in C++.

Any method that must be called from device code should be defined with both __device__ and __host__ declspecs, including the constructor and destructor if you plan to use new/delete on the device (note new/delete require CUDA 4.0 and a compute capability 2.0 or higher GPU).

You probably want to define a macro like

#ifdef __CUDACC__
#define CUDA_CALLABLE_MEMBER __host__ __device__
#else
#define CUDA_CALLABLE_MEMBER
#endif

Then use this macro on your member functions

class Foo {
public:
    CUDA_CALLABLE_MEMBER Foo() {}
    CUDA_CALLABLE_MEMBER ~Foo() {}
    CUDA_CALLABLE_MEMBER void aMethod() {}
};

The reason for this is that only the CUDA compiler knows __device__ and __host__ -- your host C++ compiler will raise an error.

Edit:
Note __CUDACC__ is defined by NVCC when it is compiling CUDA files. This can be either when compiling a .cu file with NVCC or when compiling any file with the command line option -x cu.

How do I properly implement classes whose members are called both from host and device code in Cuda/C++?

The problem here has to do with how you allocate for TestClass:

TestClass* test_class = new TestClass();

test_class is now an ordinary pointer to host memory. If you have any intent of using that pointer in device code:

void TestClass::RunKernel() {
    test_kernel<<<1,1>>>(this, test_table_);
                         ^^^^

and:

void test_kernel(TestClass* test_class, TestTable* test_table) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < 1e6; i += stride) {
        const float val = test_class->GetValue();
                          ^^^^^^^^^^

that won't work. In CUDA, dereferencing a host pointer in device code is generally a fundamental problem.

We can fix this by using placement new with a managed allocator, for the top-level class:

//TestClass* test_class = new TestClass();
TestClass* test_class;
cudaMallocManaged(&test_class, sizeof(TestClass));
new(test_class) TestClass();

When we do so, its necessary to also change the deallocator. And as indicated in the comment, you should also make sure the destructor is called before de-allocation:

// delete test_class;
test_class->~TestClass();
cudaFree(test_class);

When I make those changes, your code runs without runtime error for me.

Accessing class data members from within cuda kernel - how to design proper host/device interaction?

Your approach should be workable. When you pass an object by value as a kernel parameter (as you have indicated) there really isn't much setup that needs to be done associated with the transfer from host to device.

You need to properly allocate data on the host and the device, and use cudaMemcpy type operations at appropriate points to move the data, just as you would in an ordinary CUDA program.

One thing to be aware of when declaring an object at global scope as you have done, is that it is recommended not to use CUDA API calls in the object's constructor or destructor. The reasons are covered here, I won't repeat them here. Although that treatment mostly focuses on kernels launched before main, the CUDA lazy initialization can also impact any CUDA API call that is executed outside of main scope, which applies to constructors and destructors of objects instantiated at global scope.

What follows is a fleshed out example from what you have shown. I mostly didn't change the code you had already written, just added some method definitions for the ones you hadn't. There's obviously a lot of different possible approaches here. For more examples you might want to look at the CUDA C++ integration sample code.

Here's a worked example around what you have shown:

$ cat t1236.cu
#include <cstdio>

class myClass
{
public:
        bool bool_var;    // Set from host and readable from device
        int  data_size;   // Set from host
        __host__ myClass();
        __host__ ~myClass();
        __host__ void setValues(bool iftrue, int size);
        __device__ void dosomething(int device_parameter);
        __host__ void export_data();

        // completely unknown methods
        __host__ void prepareDeviceObj();
        __host__ void retrieveDataToHost();
private:
        int *data; // Filled in device, shared between threads, at the end copied back to host for data output
        int *h_data;
};

__host__ myClass::myClass()
{
}

__host__ myClass::~myClass()
{
}

__host__ void myClass::prepareDeviceObj(){
        cudaMemcpy(data, h_data, data_size*sizeof(h_data[0]), cudaMemcpyHostToDevice);
}
__host__ void myClass::retrieveDataToHost(){
        cudaMemcpy(h_data, data, data_size*sizeof(h_data[0]), cudaMemcpyDeviceToHost);
}

__host__ void myClass::setValues(bool iftrue, int size)
{
        bool_var  = iftrue;
        data_size = size;
        cudaMalloc(&data, data_size*sizeof(data[0]));
        h_data = (int *)malloc(data_size*sizeof(h_data[0]));
        memset(h_data, 0, data_size*sizeof(h_data[0]));
}

__device__ void myClass::dosomething(int idx)
{
        int toadd = idx+data_size;
        atomicAdd(&(data[idx]), toadd); // data should be unique among threads
}
__host__ void myClass::export_data(){
        for (int i = 0; i < data_size; i++) printf("%d ", h_data[i]);
        printf("\n");
        cudaFree(data);
        free(h_data);
}

__global__ void myKernel(myClass obj)
{
        const int idx = blockIdx.x*blockDim.x + threadIdx.x;
        if(idx < obj.data_size)
        {
                if(!obj.bool_var)
                        printf("Object is not up to any task here!");
                else
                {
                        //printf("Object is ready!");
                        obj.dosomething(idx);
                }
        }
}

myClass globalInstance;

int main(int argc, char** argv)
{
        int some_number = 40;
        globalInstance.setValues(true, some_number);
        globalInstance.prepareDeviceObj();
        myKernel<<<1,some_number>>>(globalInstance);
        globalInstance.retrieveDataToHost();
        globalInstance.export_data();
        exit(EXIT_SUCCESS);
}
$ nvcc -o t1236 t1236.cu
$ cuda-memcheck ./t1236
========= CUDA-MEMCHECK
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
========= ERROR SUMMARY: 0 errors
$

Can a class simply be passed to a CUDA kernel for parallel evaluation?

It's possible to use classes/objects on the GPU, including their methods (e.g. fun()). Such classes would at a minimum have to have methods that are decorated with __host__ __device__, but the code refactoring may not be any more involved than that.

However, such methods (like any other code with no refactoring) will probably not access any parallel power of the GPU. The most basic comparison would be that methods that run in a single CPU thread would then run in a single GPU thread. This normally is not faster and frequently would be a lot slower, if all you did was pass a single object to the GPU and run the equivalent single-threaded code on the GPU (in a single GPU thread).

One possible strategy is that if you have a great deal of these objects, or equivalently, in your case, a great deal of "points", that represent independent work to be done on each, then you could pass each one (object, or point) to a GPU thread, and work on them that way, so as to enable massively multithreaded operation, which GPUs like. Ideally you would have 10,000 or more points to process this way.

This still usually isn't the most efficient use of the GPU for a variety of reasons, one of which has to do with efficient data access, and another has to do with (possible) thread divergence. Nevertheless, some people do pursue this "simple", "embarassingly parallel" approach to code porting, occasionally with interesting speed-ups.

Depending on your actual code, you might see attractive results if you pass the points to the GPU in a fashion that allows adjacent threads to access adjacent data, for each operation that accesses the points. It's quite reasonable to expect that you might get an attractive speed up in that fashion, perhaps with relatively little code refactoring, but attention to data organization for optimal GPU access.

Here's a fully worked example:

$ cat t30.cu
#include <iostream>
#include <cstdlib>

const int dsize = 3;
const int nTPB = 256;
const int rng = 8;

class myclass
{

  int increment;
  public:
    myclass(int _incr): increment(_incr) {};
    // methods callable on the device need the __device__ decoration
    __host__ __device__ void fun(int &x, int &y, int &z){
      x += increment;
      y += increment;
      z += increment;}

};

// this is the actual device routine that is run per thread
__global__ void mykernel(myclass obj, int *dx, int *dy, int *dz, int dsize){

  int idx = threadIdx.x+blockDim.x*blockIdx.x; // figure out which thread we are
  if (idx < dsize)
    obj.fun(dx[idx], dy[idx], dz[idx]);  // apply method
}

int main(){

  // allocate host data
  int *p_x, *p_y, *p_z, *d_x, *d_y, *d_z;
  p_x = new int[dsize];
  p_y = new int[dsize];
  p_z = new int[dsize];

  // allocate device data
  cudaMalloc(&d_x, dsize*sizeof(int));
  cudaMalloc(&d_y, dsize*sizeof(int));
  cudaMalloc(&d_z, dsize*sizeof(int));

  // initialize host data
  std::cout << "Before:" << std::endl;
  for (int i = 0; i < dsize; i++){
    p_x[i] = rand()%rng;
    p_y[i] = rand()%rng;
    p_z[i] = rand()%rng;
    std::cout << p_x[i] << "," << p_y[i] << "," << p_z[i] << std::endl;}

  // copy to device
  cudaMemcpy(d_x, p_x, dsize*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, p_y, dsize*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_z, p_z, dsize*sizeof(int), cudaMemcpyHostToDevice);

  // instantiate object on host
  myclass test(1);

  // copy object to device as kernel parameter
  mykernel<<<(dsize+nTPB-1)/nTPB, nTPB>>>(test, d_x, d_y, d_z, dsize);

  // copy data back to host
  cudaMemcpy(p_x, d_x, dsize*sizeof(int), cudaMemcpyDeviceToHost);
  cudaMemcpy(p_y, d_y, dsize*sizeof(int), cudaMemcpyDeviceToHost);
  cudaMemcpy(p_z, d_z, dsize*sizeof(int), cudaMemcpyDeviceToHost);

  std::cout << "After:" << std::endl;
  for (int i = 0; i < dsize; i++){
    std::cout << p_x[i] << "," << p_y[i] << "," << p_z[i] << std::endl;}

  return 0;
}
$ nvcc -o t30 t30.cu
$ ./t30
Before:
7,6,1
3,1,7
2,4,1
After:
8,7,2
4,2,8
3,5,2
$

For brevity of presentation, I've omitted proper cuda error checking but I would always recommend you use it when you are developing CUDA codes.

CUDA: passing class to device with a class member that is a pointer function

This was approximately the "smallest" number of changes I could make to your code to get it to function approximately as you appeared to intend. Also note that there are many other questions about function pointers in CUDA, this answer links to several.

decorate f1 with __host__ __device__. This is necessary to get the compiler to generate a device-callable routine for it. Otherwise, only host code is generated.
we need to capture the device entry address for the device callable version of f1 created in 1 above. There are a number of methods to do this. I will capture it "statically" with another __device__ variable (f1_d) and then use cudaMemcpyFromSymbol to pull it into host code.
Your genericFunction class is modified to be able to hold both __host__ and separate __device__ entry points (function pointers) for the desired function. Also, the class is modified to select the proper one, based on whether we are compiling the host or device version of the class (__CUDA_ARCH__ macro), and the class constructor is modified to accept and assign both entry points.
Finally, we also need to initialize the d_g2 object on the device. In the case of the d_g1 object, there are no class data members for that object, so we can "get away with" creating an "empty" object pointed to by d_g1 and it works correctly because the entry points for that object's class member functions are already known in device code. However, in the case of d_g2, we are accessing the functions indirectly via class data members which are pointers to the respective host and device versions (entry points) of the function. Therefore, after initializing the h_g2 object in host code, and establishing storage for the d_g2 object in device code, we must initialize d_g2 with the contents of h_g2 using cudaMemcpy after the cudaMallocManaged for d_g2.

With those changes, your code works as written according to my test:

$ cat t353.cu
#include <iostream>
#include <stdio.h>

class fixedFunction{
public:
 __host__ fixedFunction() {}
 __host__ __device__ double operator()(double x) {
    return x*x;
 }
};

__host__ __device__ double f1(double x){
  return x*x;
}

typedef double (*pf) (double var);

__device__ pf f1_d = f1;

class genericFunction{
public:
  __host__ genericFunction(double (*h_infunc)(double), double (*d_infunc)(double)) : h_func(h_infunc),d_func(d_infunc){}
  __host__ __device__ double operator()(double x) {
#ifdef __CUDA_ARCH__
    return d_func(x);
#else
    return h_func(x);
#endif
  }
private:
  pf h_func;
  pf d_func;
};

__global__ void kernel1(fixedFunction* g1){
  unsigned int tid = blockIdx.x *blockDim.x + threadIdx.x;
  printf("Func val is: %f\n", (*g1)(tid));
}

__global__ void kernel2(genericFunction* g1){
  unsigned int tid = blockIdx.x *blockDim.x + threadIdx.x;
  printf("Func val is: %f\n", (*g1)(tid));
}

int main(){

  fixedFunction h_g1;
  fixedFunction* d_g1;
  cudaMallocManaged(&d_g1, sizeof(h_g1));

  //Host call
  std::cout << h_g1(2.0) << "\n";

  //device call
  kernel1<<<1,32>>>(d_g1);
  cudaDeviceSynchronize();
  pf d_f1;
  cudaMemcpyFromSymbol(&d_f1, f1_d, sizeof(void*));
  genericFunction h_g2(f1, d_f1);
  genericFunction* d_g2;
  cudaMallocManaged(&d_g2, sizeof(h_g2));
  cudaMemcpy(d_g2, &h_g2, sizeof(h_g2), cudaMemcpyDefault);
  //Host call
  std::cout << h_g2(3.0) << "\n";

  //device call
  kernel2<<<1,32>>>(d_g2);
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_61 -o t353 t353.cu
$ cuda-memcheck ./t353
========= CUDA-MEMCHECK
4
Func val is: 0.000000
Func val is: 1.000000
Func val is: 4.000000
Func val is: 9.000000
Func val is: 16.000000
Func val is: 25.000000
Func val is: 36.000000
Func val is: 49.000000
Func val is: 64.000000
Func val is: 81.000000
Func val is: 100.000000
Func val is: 121.000000
Func val is: 144.000000
Func val is: 169.000000
Func val is: 196.000000
Func val is: 225.000000
Func val is: 256.000000
Func val is: 289.000000
Func val is: 324.000000
Func val is: 361.000000
Func val is: 400.000000
Func val is: 441.000000
Func val is: 484.000000
Func val is: 529.000000
Func val is: 576.000000
Func val is: 625.000000
Func val is: 676.000000
Func val is: 729.000000
Func val is: 784.000000
Func val is: 841.000000
Func val is: 900.000000
Func val is: 961.000000
9
Func val is: 0.000000
Func val is: 1.000000
Func val is: 4.000000
Func val is: 9.000000
Func val is: 16.000000
Func val is: 25.000000
Func val is: 36.000000
Func val is: 49.000000
Func val is: 64.000000
Func val is: 81.000000
Func val is: 100.000000
Func val is: 121.000000
Func val is: 144.000000
Func val is: 169.000000
Func val is: 196.000000
Func val is: 225.000000
Func val is: 256.000000
Func val is: 289.000000
Func val is: 324.000000
Func val is: 361.000000
Func val is: 400.000000
Func val is: 441.000000
Func val is: 484.000000
Func val is: 529.000000
Func val is: 576.000000
Func val is: 625.000000
Func val is: 676.000000
Func val is: 729.000000
Func val is: 784.000000
Func val is: 841.000000
Func val is: 900.000000
Func val is: 961.000000
========= ERROR SUMMARY: 0 errors
$

Accessing Class Member in different CUDA kernels

When you pass a parameter to a CUDA kernel, it is a pass-by-value mechanism. You have started with a pointer to an object:

T* obj;

then, instead of allocating storage for the object, you allocate storage for another pointer:

cudaMalloc((void**)&obj, sizeof(T*));

so we're headed down the wrong path here. (This is a logical C programming error at this point.) Next, in the allocate kernel, the obj parameter (which now points to some location in GPU memory space) is passed by value:

__global__ void cudaAllocateGPUObj(T* obj)
                                      ^^^ pass-by-value: local copy is made

Now, when you do this:

        obj = new T;

You create a new pointer, and overwrite the local copy of obj with that new pointer. So of course that works locally, but the copy of obj in the calling environment is not updated with that new pointer.

One possible method to fix this is to create a proper pointer-to-pointer methodology:

$ cat t5.cu
#include <stdio.h>

class T
{
public:
    int v;
public:
    __device__ T() { v = 10; }
    __device__ ~T() {}
    __device__ int compute() { return v; }
};

__global__ void kernel(T** obj, int* out)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        out[0] = (*obj)->compute(); 
    }
}

__global__ void cudaAllocateGPUObj(T** obj)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        *obj = new T;
    }
}

int main()
{
    int cpu, *gpu;
    cudaMalloc((void**)&gpu, sizeof(int));
    T** obj;
    cudaMalloc(&obj, sizeof(T*));
    cudaAllocateGPUObj<<<1,1>>>(obj);
    kernel<<<1,1>>>(obj, gpu);
    cudaMemcpy(&cpu, gpu, sizeof(int), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    printf("cudaMemcpy\nresult: %d\n", cpu);
    return 0;
}

$ nvcc -arch=sm_35 -o t5 t5.cu
$ cuda-memcheck ./t5
========= CUDA-MEMCHECK
cudaMemcpy
result: 10
========= ERROR SUMMARY: 0 errors
$

CUDA - Dynamic Shared Memory with Derived Classes

Based on my experience, an object copy:

= B();

does not copy the virtual function pointer table. Therefore it is necessary for the virtual function pointer table to be set properly in whatever object you are accessing a virtual function from.

This allows for that:

extern __shared__ B shared[];

This does not:

extern __shared__ int shared[];

AFAIK aspects of this are implementation specific; not required by the C++ standard.

As a proof point, we can do something like this in your failing kernel:

__global__
void kernel() {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    extern __shared__ int shared[];
    B* b_array = (B *) &shared[0];
    
    if (idx == 0) {
        B temp = B();
        memcpy(b_array, &temp, sizeof(B));
        
        printf("%i", b_array[0].foo());
    }
    
    __syncthreads();
    
    return;
}

which will then work. I'm not suggesting this is the right way to code it. I'm simply using this to suggest that at least one problem here is the handling of the table. As Jerome Richard points out in the comments, the usage of an underlying int array for type-punning to something else may be illegal, however as you point out, the cuda docs seem to suggest this.

We can also construct a host code test case following your failing example:

$ cat t131.cpp
#include <cstdio>

class A {
    public:
     virtual int foo() const = 0;
};

class B : public A {
    public:
     B() {}
     virtual int foo() const override {
        return 3;
    }
};

void k1() {

    int sh1[100];
    B* b_array = (B *) &sh1[0];
        b_array[0] = B();

        printf("k1 %i\n", b_array[0].foo());

    return;
}

int main(){
  k1();
}

$ g++ t131.cpp -o t131
$ ./t131
Segmentation fault (core dumped)
$

Which also fails.

You're welcome to file a bug if you find fault with my description or simply wish this case to be handled.

The exact code matters here, so slight changes to my test cases above may result in working or failing code.