Why Does Cudamalloc() Use Pointer to Pointer

Why does cudaMalloc() use pointer to pointer?

In C, data can be passed to functions by value or via simulated pass-by-reference (i.e. by a pointer to the data). By value is a one-way methodology, by pointer allows for two-way data flow between the function and its calling environment.

When a data item is passed to a function via the function parameter list, and the function is expected to modify the original data item so that the modified value shows up in the calling environment, the correct C method for this is to pass the data item by pointer. In C, when we pass by pointer, we take the address of the item to be modified, creating a pointer (perhaps a pointer to a pointer in this case) and hand the address to the function. This allows the function to modify the original item (via the pointer) in the calling environment.

Normally malloc returns a pointer, and we can use assignment in the calling environment to assign this returned value to the desired pointer. In the case of cudaMalloc, the CUDA designers chose to use the returned value to carry an error status rather than a pointer. Therefore the setting of the pointer in the calling environment must occur via one of the parameters passed to the function, by reference (i.e. by pointer). Since it is a pointer value that we want to set, we must take the address of the pointer (creating a pointer to a pointer) and pass that address to the cudaMalloc function.

How to use pointer to pointer in cuda

The following modification will "fix" your code (fully worked example, including host and device verification):

$ cat t583.cu
#include <stdio.h>

__global__ void testkernel(char **data, unsigned n){
  for (int i = 0; i < 100; i++) if (data[n][i] != 1) printf("kernel error\n");
}

int main(){
  char** d_ppcPtr, *d_pcPtr, *h_pcPtr;
  cudaMalloc(&d_ppcPtr, sizeof(char*) * 10);

  h_pcPtr = (char*)malloc(sizeof(char) * 100);
  for(int i = 0; i < 10; i ++)
  {
      cudaMalloc(&d_pcPtr, sizeof(char) * 100);
      cudaMemset(d_pcPtr, 1, sizeof(char) * 100);
      cudaMemcpy(&d_ppcPtr[i], &d_pcPtr, sizeof(char*), cudaMemcpyHostToDevice);
      memset(h_pcPtr, 0, sizeof(char)*100);
      testkernel<<<1,1>>>(d_ppcPtr, i);
      cudaMemcpy(h_pcPtr, d_pcPtr, sizeof(char) * 100, cudaMemcpyDeviceToHost);
      cudaFree(d_pcPtr);
      for (int i = 0; i < 100; i++) if (h_pcPtr[i] != 1) printf("Error!");
  }
  cudaFree(d_ppcPtr);
}
$ nvcc -arch=sm_20 -o t583 t583.cu
$ cuda-memcheck ./t583
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

Note that conceptually, there is no difference between my code and yours, because the pointer that you are attempting to use in location d_ppcPtr[i], (and is crashing, because it is located on the device,) is already contained in d_pcPtr, which is on the host.

Use of cudamalloc(). Why the double pointer?

All CUDA API functions return an error code (or cudaSuccess if no error occured). All other parameters are passed by reference. However, in plain C you cannot have references, that's why you have to pass an address of the variable that you want the return information to be stored. Since you are returning a pointer, you need to pass a double-pointer.

Another well-known function which operates on addresses for the same reason is the scanf function. How many times have you forgotten to write this & before the variable that you want to store the value to? ;)

int i;
scanf("%d",&i);

Why is it necessary to cast to void** (e.g. in cudaMalloc calls)?

Casting to void** is always wrong as this type is not a generic pointer.

Thus when a function has a parameter of type void**, the only type of argument passed to it can be of type: void**, making any cast either wrong or unnecessary.

The correct way (ignoring error checking) of getting memory from cudaMalloc is:

void* mem;
cudaMalloc( &mem , num_int_bytes );
int* array = mem;

cudaMalloc( &mem , num_double_bytes );
double* floating = mem;

Pointer to Array of Pointers to Objects in CUDA

First of all, it's convenient if you provide a complete code, including a definition for the Obj class. I've provided one based on inspection of your code and some guesswork.

Second, much of your confusion here seems to be a less-than-crisp facility with pointers in C (or C++). Using the CUDA API with double-pointer constructions (**) between host and device requires a crisp understanding and ability to visualize what is happening.

If the line indicated by SEE QUESTION 1 above allocates host memory for the pointers, and once I have used cudaMalloc in the subsequent loop to allocate device memory, the pointer pointed to by h_d_obj get overwritten with device addresses, does that mean I have allocated host memory for 3 Obj* that now has no pointer pointing to it?

No. h_d_obj is established (i.e. given a meaningful value) by the malloc operation. Nothing you have done subsequent to that modifies the value of h_d_obj.

Why is the cudaMemcpy succeeding when I test the status returned but clearly does not copy the addresses correctly? I was expecting the "arrays" of memory address of both h_d_obj and d_d_obj to be the same since they should point to the same Obj in the device address space.

I don't see anything wrong with your code up to this point. The value of h_d_obj was established (previously) by malloc, and the numerical value of it is an address in host memory. The value of d_d_obj was established by cudaMalloc, and the numerical value of it is an address in device memory. Numerically, I would expect them to be different.

At the line SEE QUESTION 3, assuming I'm correct in question 2. I also expect to be able to use either h_d_obj or d_d_obj to retrieve the Obj objects from the device since the difference would be only whether I dereference a host pointer to access a device pointer to Obj or a device pointer both of which I can do in a cudaMemcpy method right? If I use what is written, the copy succeeds but the pointer at h_obj[0] is corrupted and I cannot write out the data.

NO. You cannot dereference a device pointer in host code, even if it is a parameter in cudaMemcpy. This is legal as a source or destination in a cudaMemcpy operation:

h_d_obj[i]

This is not legal:

d_d_obj[i]

The reason is that in order to get the actual target address, I must dereference a host pointer (i.e. access a memory location on the host) in the first case, but a device pointer in the second case. From host code, I can retrieve the contents of h_d_obj[i]. I am not allowed to try to retrieve the contents of d_d_obj[i] in host code (and the parameter manipulation for cudaMemcpy is host code). The value of d_d_obj can be used as a destination from host code. d_d_obj[i] cannot.

At the line SEE QUESTION 4, why can I not dereference an Obj** to get a Obj* then use the -> operator to call a device method? The compiler moans that it is not a pointer to class type which the fact that it is a Obj* tells me it is.

The compiler is barking at you because you don't understand the order of operations between the various operators (*, ->) that you are using. If you add parenthesis to identify the correct order:

(*d_array_of_objs)->changeToFive();

Then the compiler won't object to that (although I would do it slightly differently as below).

Here's a modified version of your code with the Obj definition addition, a slight change to the kernel so that independent threads work on independent objects, and a few other fixes. Your code was mostly correct:

$ cat t1231.cu
#include <iostream>

class Obj{

  public:
  int scalar;
  __host__ __device__
  void changeToFive() {scalar = 5;}
};

// Kernel
__global__ void myKernel(Obj** d_array_of_objs)
{
    // Change the scalar of each object to 5
    // by dereferencing device array to get
    // appropriate object pointer.
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    // (*d_array_of_objs)->changeToFive();  //  <--------- SEE QUESTION 4 (add parenthesis)
    d_array_of_objs[idx]->changeToFive();
}

// Entry point
int main()
{

    /********************************/
    /* INITIALISE OBJ ARRAY ON HOST */
    /********************************/

    // Array of 3 pointers to Objs
    Obj* h_obj[3];
    for (int i = 0; i < 3; i++) {
        h_obj[i] = new Obj();       // Create
        h_obj[i]->scalar = i * 10;  // Initialise
    }

    // Write out
    for (int i = 0; i < 3; i++) {
        std::cout << h_obj[i]->scalar << std::endl;
    }

    /**************************************************/
    /* CREATE DEVICE VERSIONS AND STORE IN HOST ARRAY */
    /**************************************************/

    // Create host pointer to array-like storage of device pointers
    Obj** h_d_obj = (Obj**)malloc(sizeof(Obj*) * 3); //    <--------- SEE QUESTION 1
    for (int i = 0; i < 3; i++) {
        // Allocate space for an Obj and assign
        cudaMalloc((void**)&h_d_obj[i], sizeof(Obj));
        // Copy the object to the device (only has single scalar field to keep it simple)
        cudaMemcpy(h_d_obj[i], &(h_obj[i]), sizeof(Obj), cudaMemcpyHostToDevice);
    }

    /**************************************************/
    /* CREATE DEVICE ARRAY TO PASS POINTERS TO KERNEL */
    /**************************************************/

    // Create a pointer which will point to device memory
    Obj** d_d_obj = NULL;
    // Allocate space for 3 pointers on device at above location
    cudaMalloc((void**)&d_d_obj, sizeof(Obj*) * 3);
    // Copy the pointers from the host memory to the device array
    cudaMemcpy(d_d_obj, h_d_obj, sizeof(Obj*) * 3, cudaMemcpyHostToDevice);

    /**********
     * After the above, VS2013 shows the memory pointed to by d_d_obj
     * to be NULL <------- SEE QUESTION 2.
     **********/

    // Launch Kernel
    myKernel <<<1, 3>>>(d_d_obj);

    // Synchronise and pass back to host
    cudaDeviceSynchronize();
    for (int i = 0; i < 3; i++) {
        cudaMemcpy(h_obj[i], h_d_obj[i], sizeof(Obj), cudaMemcpyDeviceToHost);  //   <--------- SEE QUESTION 3  remove parenthesis
    }

    // Write out
    for (int i = 0; i < 3; i++) {
        std::cout << h_obj[i]->scalar << std::endl;
    }

    return 0;
}
$ nvcc -o t1231 t1231.cu
$ cuda-memcheck ./t1231
========= CUDA-MEMCHECK
0
10
20
5
5
5
========= ERROR SUMMARY: 0 errors
$

A diagram of h_d_obj and d_d_obj might help:

HOST                               |    DEVICE
h_d_obj-->(Obj *)-------------------------->Obj0<---(Obj *)<----|
          (Obj *)-------------------------->Obj1<---(Obj *)     |
          (Obj *)-------------------------->Obj2<---(Obj *)     |
                                   |                            |
d_d_obj---------------------------------------------------------|
HOST                               |    DEVICE

You're allowed to access any quantity (location) on the left hand side (HOST) of the above diagram, in host code, or in a cudaMemcpy operation. You're not allowed to access any quantity (location) on the right hand side, in host code.

How to use cudaMalloc / cudaMemcpy for a pointer to a structure containing pointers?

You have to be aware where your memory resides. malloc allocates host memory, cudaMalloc allocates memory on the device and returns a pointer to that memory back. However, this pointer is only valid in device functions.

What you want could be achived as followed:

typedef struct {
    int width;
    int height;
    float* elements;
} Matrix;

int main void() {
    int rows, cols, numMat = 2; // These are actually determined at run-time
    Matrix* data = (Matrix*)malloc(numMat * sizeof(Matrix));

    // ... Successfully read from file into "data" ...
    Matrix* h_data = (Matrix*)malloc(numMat * sizeof(Matrix));
    memcpy(h_data, data, numMat * sizeof(Matrix);

    for (int i=0; i<numMat; i++){

        cudaMalloc(&(h_data[i].elements), rows*cols*sizeof(float));
        cudaMemcpy(h_data[i].elements, data[i].elements,  rows*cols*sizeof(float)), cudaMemcpyHostToDevice);

     }// matrix data is now on the gpu, now copy the "meta" data to gpu
     Matrix* d_data;
     cudaMalloc(&d_data, numMat*sizeof(Matrix)); 
     cudaMemcpy(d_data, h_data, numMat*sizeof(Matrix));
     // ... Do other things ...
}

To make things clear:
Matrix* data contains the data on the host.
Matrix* h_data contains a pointer to the device memory in elements which can be passed to the kernels as parameters. The memory is on the GPU.
Matrix* d_data is completly on the GPU and can be used like data on the host.

in your kernel code you kann now access the matrix values, e.g.,

__global__ void doThings(Matrix* matrices)
{
      matrices[i].elements[0] = 42;
}

CUDA device pointers

You are right: cudaMalloc allocates memory on the device. You can't use this pointer directly on the host, but only as argument to functions like cudaMemcpy, and as arguments to kernel calls.

More recent CUDA versions support unified memory addressing, there you can use cudaMallocManaged to allocate device memory, and access it on the host directly via the device pointer.

For the second question: C++ doesn't allow implicit casts between pointer types, so there leaving out the explicit cast (void**)&x_device will result in a compiler error.

Handling Image pointer to pointer in CUDA

I'm not going to try and sort out your complex matrix allocation scheme. The purpose of my suggestion was so that you can simplify things to simple 1-line allocations.

Furthermore, I don't think you really grasped the example I gave. It was a 3D example, and the typedefs had 2 subscripts. A 2D version would have typedefs with a single subscript.

Really none of this has to do with CUDA. It revolves around understanding of C arrays and pointers.

Those were the major changes I made to get your code working:

#include <stdio.h>
#include <stdlib.h>
#define hsize 256
#define vsize 256

#define IMAGE_TYPE unsigned char

__global__ void kernel(IMAGE_TYPE matrixin[][hsize], IMAGE_TYPE matrixout[][hsize]) {
  int tid=threadIdx.x;
  int bid=blockIdx.x;

  matrixout[bid][tid]=matrixin[bid][tid];
}

int fatal(char* s) {
  fprintf(stderr,"%s\n",s);
  return 1;
}

int main() {
  typedef IMAGE_TYPE IMarray[hsize];
  IMarray *hin_image,*hout_image;

  IMarray *din_image,*dout_image;

//allocate host memory
  hin_image = (IMarray *)malloc(hsize*vsize*sizeof(IMAGE_TYPE));
  hout_image = (IMarray *)malloc(hsize*vsize*sizeof(IMAGE_TYPE));

  for(int i=0;i<vsize;i++)
    for(int j=0;j<hsize;j++)
        hin_image[i][j]='a';

//allocate device memory

  cudaMalloc((void**)&din_image,(vsize*hsize)*sizeof(IMAGE_TYPE));
  cudaMalloc((void**)&dout_image,(vsize*hsize)*sizeof(IMAGE_TYPE));
  cudaMemset(dout_image, 0, (vsize*hsize)*sizeof(IMAGE_TYPE));
  cudaMemcpy(din_image,hin_image, (vsize*hsize)*sizeof(IMAGE_TYPE),cudaMemcpyHostToDevice);

  dim3 threads(hsize,1,1);
  dim3 blocks(vsize,1,1);

  kernel<<<blocks,threads>>>(din_image,dout_image);

  cudaMemcpy(hout_image,dout_image,(vsize*hsize)*sizeof(IMAGE_TYPE),cudaMemcpyDeviceToHost);

  for(int i=0;i<10;i++) {
    printf("\n");
    for(int j=0;j<10;j++)
        printf("%c\t",hout_image[i][j]);
  }
  printf("\n");

  cudaFree(din_image);
  cudaFree(dout_image);

  free(hin_image);
  free(hout_image);

  return 0;
}

Why Does Cudamalloc() Use Pointer to Pointer