Why does cudaMalloc() use pointer to pointer?
In C, data can be passed to functions by value or via simulated pass-by-reference (i.e. by a pointer to the data). By value is a one-way methodology, by pointer allows for two-way data flow between the function and its calling environment.
When a data item is passed to a function via the function parameter list, and the function is expected to modify the original data item so that the modified value shows up in the calling environment, the correct C method for this is to pass the data item by pointer. In C, when we pass by pointer, we take the address of the item to be modified, creating a pointer (perhaps a pointer to a pointer in this case) and hand the address to the function. This allows the function to modify the original item (via the pointer) in the calling environment.
Normally malloc
returns a pointer, and we can use assignment in the calling environment to assign this returned value to the desired pointer. In the case of cudaMalloc
, the CUDA designers chose to use the returned value to carry an error status rather than a pointer. Therefore the setting of the pointer in the calling environment must occur via one of the parameters passed to the function, by reference (i.e. by pointer). Since it is a pointer value that we want to set, we must take the address of the pointer (creating a pointer to a pointer) and pass that address to the cudaMalloc
function.
How to use pointer to pointer in cuda
The following modification will "fix" your code (fully worked example, including host and device verification):
$ cat t583.cu
#include <stdio.h>
__global__ void testkernel(char **data, unsigned n){
for (int i = 0; i < 100; i++) if (data[n][i] != 1) printf("kernel error\n");
}
int main(){
char** d_ppcPtr, *d_pcPtr, *h_pcPtr;
cudaMalloc(&d_ppcPtr, sizeof(char*) * 10);
h_pcPtr = (char*)malloc(sizeof(char) * 100);
for(int i = 0; i < 10; i ++)
{
cudaMalloc(&d_pcPtr, sizeof(char) * 100);
cudaMemset(d_pcPtr, 1, sizeof(char) * 100);
cudaMemcpy(&d_ppcPtr[i], &d_pcPtr, sizeof(char*), cudaMemcpyHostToDevice);
memset(h_pcPtr, 0, sizeof(char)*100);
testkernel<<<1,1>>>(d_ppcPtr, i);
cudaMemcpy(h_pcPtr, d_pcPtr, sizeof(char) * 100, cudaMemcpyDeviceToHost);
cudaFree(d_pcPtr);
for (int i = 0; i < 100; i++) if (h_pcPtr[i] != 1) printf("Error!");
}
cudaFree(d_ppcPtr);
}
$ nvcc -arch=sm_20 -o t583 t583.cu
$ cuda-memcheck ./t583
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
Note that conceptually, there is no difference between my code and yours, because the pointer that you are attempting to use in location d_ppcPtr[i]
, (and is crashing, because it is located on the device,) is already contained in d_pcPtr
, which is on the host.
Use of cudamalloc(). Why the double pointer?
All CUDA API functions return an error code (or cudaSuccess if no error occured). All other parameters are passed by reference. However, in plain C you cannot have references, that's why you have to pass an address of the variable that you want the return information to be stored. Since you are returning a pointer, you need to pass a double-pointer.
Another well-known function which operates on addresses for the same reason is the scanf
function. How many times have you forgotten to write this &
before the variable that you want to store the value to? ;)
int i;
scanf("%d",&i);
Why is it necessary to cast to void** (e.g. in cudaMalloc calls)?
Casting to void**
is always wrong as this type is not a generic pointer.
Thus when a function has a parameter of type void**
, the only type of argument passed to it can be of type: void**
, making any cast either wrong or unnecessary.
The correct way (ignoring error checking) of getting memory from cudaMalloc is:
void* mem;
cudaMalloc( &mem , num_int_bytes );
int* array = mem;
cudaMalloc( &mem , num_double_bytes );
double* floating = mem;
Pointer to Array of Pointers to Objects in CUDA
First of all, it's convenient if you provide a complete code, including a definition for the Obj
class. I've provided one based on inspection of your code and some guesswork.
Second, much of your confusion here seems to be a less-than-crisp facility with pointers in C (or C++). Using the CUDA API with double-pointer constructions (**
) between host and device requires a crisp understanding and ability to visualize what is happening.
If the line indicated by SEE QUESTION 1 above allocates host memory for the pointers, and once I have used
cudaMalloc
in the subsequent loop to allocate device memory, the pointer pointed to byh_d_obj
get overwritten with device addresses, does that mean I have allocated host memory for 3 Obj* that now has no pointer pointing to it?
No. h_d_obj
is established (i.e. given a meaningful value) by the malloc
operation. Nothing you have done subsequent to that modifies the value of h_d_obj
.
Why is the cudaMemcpy succeeding when I test the status returned but clearly does not copy the addresses correctly? I was expecting the "arrays" of memory address of both
h_d_obj
andd_d_obj
to be the same since they should point to the same Obj in the device address space.
I don't see anything wrong with your code up to this point. The value of h_d_obj
was established (previously) by malloc
, and the numerical value of it is an address in host memory. The value of d_d_obj
was established by cudaMalloc
, and the numerical value of it is an address in device memory. Numerically, I would expect them to be different.
At the line SEE QUESTION 3, assuming I'm correct in question 2. I also expect to be able to use either
h_d_obj
ord_d_obj
to retrieve the Obj objects from the device since the difference would be only whether I dereference a host pointer to access a device pointer to Obj or a device pointer both of which I can do in a cudaMemcpy method right? If I use what is written, the copy succeeds but the pointer at h_obj[0] is corrupted and I cannot write out the data.
NO. You cannot dereference a device pointer in host code, even if it is a parameter in cudaMemcpy
. This is legal as a source or destination in a cudaMemcpy
operation:
h_d_obj[i]
This is not legal:
d_d_obj[i]
The reason is that in order to get the actual target address, I must dereference a host pointer (i.e. access a memory location on the host) in the first case, but a device pointer in the second case. From host code, I can retrieve the contents of h_d_obj[i]
. I am not allowed to try to retrieve the contents of d_d_obj[i]
in host code (and the parameter manipulation for cudaMemcpy
is host code). The value of d_d_obj
can be used as a destination from host code. d_d_obj[i]
cannot.
At the line SEE QUESTION 4, why can I not dereference an Obj** to get a Obj* then use the -> operator to call a device method? The compiler moans that it is not a pointer to class type which the fact that it is a Obj* tells me it is.
The compiler is barking at you because you don't understand the order of operations between the various operators (*
, ->
) that you are using. If you add parenthesis to identify the correct order:
(*d_array_of_objs)->changeToFive();
Then the compiler won't object to that (although I would do it slightly differently as below).
Here's a modified version of your code with the Obj
definition addition, a slight change to the kernel so that independent threads work on independent objects, and a few other fixes. Your code was mostly correct:
$ cat t1231.cu
#include <iostream>
class Obj{
public:
int scalar;
__host__ __device__
void changeToFive() {scalar = 5;}
};
// Kernel
__global__ void myKernel(Obj** d_array_of_objs)
{
// Change the scalar of each object to 5
// by dereferencing device array to get
// appropriate object pointer.
int idx = threadIdx.x+blockDim.x*blockIdx.x;
// (*d_array_of_objs)->changeToFive(); // <--------- SEE QUESTION 4 (add parenthesis)
d_array_of_objs[idx]->changeToFive();
}
// Entry point
int main()
{
/********************************/
/* INITIALISE OBJ ARRAY ON HOST */
/********************************/
// Array of 3 pointers to Objs
Obj* h_obj[3];
for (int i = 0; i < 3; i++) {
h_obj[i] = new Obj(); // Create
h_obj[i]->scalar = i * 10; // Initialise
}
// Write out
for (int i = 0; i < 3; i++) {
std::cout << h_obj[i]->scalar << std::endl;
}
/**************************************************/
/* CREATE DEVICE VERSIONS AND STORE IN HOST ARRAY */
/**************************************************/
// Create host pointer to array-like storage of device pointers
Obj** h_d_obj = (Obj**)malloc(sizeof(Obj*) * 3); // <--------- SEE QUESTION 1
for (int i = 0; i < 3; i++) {
// Allocate space for an Obj and assign
cudaMalloc((void**)&h_d_obj[i], sizeof(Obj));
// Copy the object to the device (only has single scalar field to keep it simple)
cudaMemcpy(h_d_obj[i], &(h_obj[i]), sizeof(Obj), cudaMemcpyHostToDevice);
}
/**************************************************/
/* CREATE DEVICE ARRAY TO PASS POINTERS TO KERNEL */
/**************************************************/
// Create a pointer which will point to device memory
Obj** d_d_obj = NULL;
// Allocate space for 3 pointers on device at above location
cudaMalloc((void**)&d_d_obj, sizeof(Obj*) * 3);
// Copy the pointers from the host memory to the device array
cudaMemcpy(d_d_obj, h_d_obj, sizeof(Obj*) * 3, cudaMemcpyHostToDevice);
/**********
* After the above, VS2013 shows the memory pointed to by d_d_obj
* to be NULL <------- SEE QUESTION 2.
**********/
// Launch Kernel
myKernel <<<1, 3>>>(d_d_obj);
// Synchronise and pass back to host
cudaDeviceSynchronize();
for (int i = 0; i < 3; i++) {
cudaMemcpy(h_obj[i], h_d_obj[i], sizeof(Obj), cudaMemcpyDeviceToHost); // <--------- SEE QUESTION 3 remove parenthesis
}
// Write out
for (int i = 0; i < 3; i++) {
std::cout << h_obj[i]->scalar << std::endl;
}
return 0;
}
$ nvcc -o t1231 t1231.cu
$ cuda-memcheck ./t1231
========= CUDA-MEMCHECK
0
10
20
5
5
5
========= ERROR SUMMARY: 0 errors
$
A diagram of h_d_obj
and d_d_obj
might help:
HOST | DEVICE
h_d_obj-->(Obj *)-------------------------->Obj0<---(Obj *)<----|
(Obj *)-------------------------->Obj1<---(Obj *) |
(Obj *)-------------------------->Obj2<---(Obj *) |
| |
d_d_obj---------------------------------------------------------|
HOST | DEVICE
You're allowed to access any quantity (location) on the left hand side (HOST) of the above diagram, in host code, or in a cudaMemcpy
operation. You're not allowed to access any quantity (location) on the right hand side, in host code.
How to use cudaMalloc / cudaMemcpy for a pointer to a structure containing pointers?
You have to be aware where your memory resides. malloc allocates host memory, cudaMalloc allocates memory on the device and returns a pointer to that memory back. However, this pointer is only valid in device functions.
What you want could be achived as followed:
typedef struct {
int width;
int height;
float* elements;
} Matrix;
int main void() {
int rows, cols, numMat = 2; // These are actually determined at run-time
Matrix* data = (Matrix*)malloc(numMat * sizeof(Matrix));
// ... Successfully read from file into "data" ...
Matrix* h_data = (Matrix*)malloc(numMat * sizeof(Matrix));
memcpy(h_data, data, numMat * sizeof(Matrix);
for (int i=0; i<numMat; i++){
cudaMalloc(&(h_data[i].elements), rows*cols*sizeof(float));
cudaMemcpy(h_data[i].elements, data[i].elements, rows*cols*sizeof(float)), cudaMemcpyHostToDevice);
}// matrix data is now on the gpu, now copy the "meta" data to gpu
Matrix* d_data;
cudaMalloc(&d_data, numMat*sizeof(Matrix));
cudaMemcpy(d_data, h_data, numMat*sizeof(Matrix));
// ... Do other things ...
}
To make things clear:Matrix* data
contains the data on the host.Matrix* h_data
contains a pointer to the device memory in elements which can be passed to the kernels as parameters. The memory is on the GPU.Matrix* d_data
is completly on the GPU and can be used like data on the host.
in your kernel code you kann now access the matrix values, e.g.,
__global__ void doThings(Matrix* matrices)
{
matrices[i].elements[0] = 42;
}
CUDA device pointers
You are right: cudaMalloc
allocates memory on the device. You can't use this pointer directly on the host, but only as argument to functions like cudaMemcpy
, and as arguments to kernel calls.
More recent CUDA versions support unified memory addressing, there you can use cudaMallocManaged
to allocate device memory, and access it on the host directly via the device pointer.
For the second question: C++ doesn't allow implicit casts between pointer types, so there leaving out the explicit cast (void**)&x_device
will result in a compiler error.
Handling Image pointer to pointer in CUDA
I'm not going to try and sort out your complex matrix allocation scheme. The purpose of my suggestion was so that you can simplify things to simple 1-line allocations.
Furthermore, I don't think you really grasped the example I gave. It was a 3D example, and the typedefs had 2 subscripts. A 2D version would have typedefs with a single subscript.
Really none of this has to do with CUDA. It revolves around understanding of C arrays and pointers.
Those were the major changes I made to get your code working:
#include <stdio.h>
#include <stdlib.h>
#define hsize 256
#define vsize 256
#define IMAGE_TYPE unsigned char
__global__ void kernel(IMAGE_TYPE matrixin[][hsize], IMAGE_TYPE matrixout[][hsize]) {
int tid=threadIdx.x;
int bid=blockIdx.x;
matrixout[bid][tid]=matrixin[bid][tid];
}
int fatal(char* s) {
fprintf(stderr,"%s\n",s);
return 1;
}
int main() {
typedef IMAGE_TYPE IMarray[hsize];
IMarray *hin_image,*hout_image;
IMarray *din_image,*dout_image;
//allocate host memory
hin_image = (IMarray *)malloc(hsize*vsize*sizeof(IMAGE_TYPE));
hout_image = (IMarray *)malloc(hsize*vsize*sizeof(IMAGE_TYPE));
for(int i=0;i<vsize;i++)
for(int j=0;j<hsize;j++)
hin_image[i][j]='a';
//allocate device memory
cudaMalloc((void**)&din_image,(vsize*hsize)*sizeof(IMAGE_TYPE));
cudaMalloc((void**)&dout_image,(vsize*hsize)*sizeof(IMAGE_TYPE));
cudaMemset(dout_image, 0, (vsize*hsize)*sizeof(IMAGE_TYPE));
cudaMemcpy(din_image,hin_image, (vsize*hsize)*sizeof(IMAGE_TYPE),cudaMemcpyHostToDevice);
dim3 threads(hsize,1,1);
dim3 blocks(vsize,1,1);
kernel<<<blocks,threads>>>(din_image,dout_image);
cudaMemcpy(hout_image,dout_image,(vsize*hsize)*sizeof(IMAGE_TYPE),cudaMemcpyDeviceToHost);
for(int i=0;i<10;i++) {
printf("\n");
for(int j=0;j<10;j++)
printf("%c\t",hout_image[i][j]);
}
printf("\n");
cudaFree(din_image);
cudaFree(dout_image);
free(hin_image);
free(hout_image);
return 0;
}
Related Topics
Sendinput() Not Equal to Pressing Key Manually on Keyboard in C++
Why Does C++ Allow an Integer to Be Assigned to a String
How to Add a Library Path in Cmake
How to Emit Cross-Thread Signal in Qt
Win32 Programming Hiding Console Window
What Do Compilers Do with Compile-Time Branching
How to Disassemble a Binary Executable in Linux to Get the Assembly Code
C++ Compiler Error C2280 "Attempting to Reference a Deleted Function" in Visual Studio 2013 and 2015
How to Create Nvidia Opencl Project
Global Scope VS Global Namespace
C++: How to Round a Double to an Int
How to Raise Warning If Return Value Is Disregarded
How to Use Createfile, But Force the Handle into a Std::Ofstream