Cudamemcpy Segmentation Fault

cudaMemcpy segmentation fault

I believe I know what the problem is, but to confirm it, it would be useful to see the code that you are using to set up the Grid_dev classes on the device.

When a class or other data structure is to be used on the device, and that class has pointers in it which refer to other objects or buffers in memory (presumably in device memory, for a class that will be used on the device), then the process of making this top-level class usable on the device becomes more complicated.

Suppose I have a class like this:

class myclass{
  int myval;
  int *myptr;
  }

I could instantiate the above class on the host, and then malloc an array of int and assign that pointer to myptr, and everything would be fine. To make this class usable on the device and the device only, the process could be similar. I could:

cudaMalloc a pointer to device memory that will hold myclass
(optionally) copy an instantiated object of myclass on the host to the device pointer from step 1 using cudaMemcpy
on the device, use malloc or new to allocate device storage for myptr

The above sequence is fine if I never want to access the storage allocated for myptr on the host. But if I do want that storage to be visible from the host, I need a different sequence:

cudaMalloc a pointer to device memory that will hold myclass, let's call this mydevobj
(optionally) copy an instantiated object of myclass on the host to the device pointer mydevobj from step 1 using cudaMemcpy
Create a separate int pointer on the host, let's call it myhostptr
cudaMalloc int storage on the device for myhostptr
cudaMemcpy the pointer value of myhostptr from the host to the device pointer &(mydevobj->myptr)

After that, you can cudaMemcpy the data pointed to by the embedded pointer myptr to the region allocated (via cudaMalloc) on myhostptr

Note that in step 5, because I am taking the address of this pointer location, this cudaMemcpy operation only requires the mydevobj pointer on the host, which is valid in a cudaMemcpy operation (only).

The value of the device pointer myint will then be properly set up to do the operations you are trying to do. If you then want to cudaMemcpy data to and from myint to the host, you use the pointer myhostptr in any cudaMemcpy calls, not mydevobj->myptr. If we tried to use mydevobj->myptr, it would require dereferencing mydevobj and then using it to retrieve the pointer that is stored in myptr, and then using that pointer as the copy to/from location. This is not acceptable in host code. If you try to do it, you will get a seg fault. (Note that by way of analogy, my mydevobj is like your Grid_dev and my myptr is like your cdata)

Overall it is a concept that requires some careful thought the first time you run into it, and so questions like this come up with some frequency on SO. You may want to study some of these questions to see code examples (since you haven't provided your code that sets up Grid_dev):

example 1
example 2
example 3

Segmentation fault on cudaMalloc or cudaMemcpy

For someone wondering what went wrong, I was able to fix it. I am not exactly sure what exactly was wrong but I had improper memory allocations at some places and in other cases I didn't even needed to use cudaMalloc or cudaMemcpy. Also, using What is the canonical way to check for errors using the CUDA runtime API? for checking errors instead of my own implementation worked. What I have now:

/***** KERNEL CONFIGURATION & MEMORY MANAGEMENT ******/
/***** GENERATE HASHED PASSWORD LIBRARY FOR COMPARE **/
unsigned int threads_per_block = 1024;
dim3  grid(1024, 1, 1);
dim3  threads(threads_per_block, 1, 1);

password* d_pwds;
ERROR_CHECK( cudaMalloc((void**) &d_pwds, pwds_size));
ERROR_CHECK( cudaMemcpy( d_pwds, h_pwds, pwds_size, cudaMemcpyHostToDevice));

libEntry* d_library;
ERROR_CHECK( cudaMalloc( (void**) &d_library, sizeof(libEntry) * count));

// generateLibraryKernel(int numPwds, password* pwds, libEntry* library)
generateLibraryKernel<<<grid, threads>>>(i, d_pwds, d_library);
ERROR_CHECK( cudaPeekAtLastError() );
ERROR_CHECK( cudaDeviceSynchronize() );

Where ERROR_CHECK is defined from the link above.

#define ERROR_CHECK(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

I still don't fully understand memory management in CUDA (device and host allocations) but my code works now! Thank you all.

What causes this segmentation fault (core dumped) error at cudaMemcpy when copying to GPU?

The problem here relates to your usage of OpenCV. An item like CV_8U is not a type, it is a compiler #define. Therefore sizeof(CV_8U) is not doing what you think it is doing. Your intended usage should be to capture the size of the underlying type (e.g. unsigned char, i.e. a type size of 1). However, sizeof(CV_8U) returns evidently the size of an integer, which is 4.

As a result of that, your calculation of size is wrong (4x too large). As a result of that, when the cudaMemcpy operation attempts to access &image.data[0] for size bytes, it will attempt to copy past the end of the buffer. For small images, the overrun doesn't trigger the run time check/limit. For a large enough size calculation (large enough image) you will hit a seg fault. Although the failure is triggered within a CUDA call, the origin of the error is outside of CUDA.

One possible solution is to replace your usage of sizeof(CV_8U) with something like sizeof(unsigned char). Since that size is 1, you can also just delete the multiplication by sizeof(CV_8U) and get the same behavior.

You can also avoid this sort of allocation and let OpenCV do the allocation (and host-device data copying) work for you as demonstrated in the answer here and here

Getting segmentation fault with device malloc() in CUDA

You are passing incorrect arguments to cudaMemcpy. This:

cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

should be

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_a, &a[0], size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b[0], size, cudaMemcpyHostToDevice);

and similarly this:

cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

should be

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

cudaMemcpy(&c[0], d_c, size, cudaMemcpyDeviceToHost);

It will be the device to host copy which will be the source of your problem - it will overwrite the stack and cause the segfault you are seeing.

Simple operation on Structure in CUDA : Segmentation fault

There are several invalid memory access in the provided code.

Accessing device memory (allocated using cudaMalloc) from host like d_data->a will cause undefined behavior (segmentation fault etc.).
cudaMemcpy takes pointers as arguments, not address of pointer. So cudaMemcpy(&d_data, &h_data... should be replaced with cudaMemcpy(d_data, h_data....

Allocating a device object with a device pointer as a member is a bit tricky. It can be achieved as follows:

Allocate a temporary host object (MyStruct temp).
Allocate device memory to the member we want on device (cudaMalloc(&temp.a, bytes)).
Allocate device object (cudaMalloc(&d_data, sizeof(MyStruct)).
Copy temporary host object to the device object (cudaMemcpy(d_data, &temp, sizeof(MyStruct), cudaMemcpyHostToDevice)).

Keep in mind that when you modify the contents of d_data->a on the device, temp.a will also be modified because they are actually pointing to same memory location on device.

Your final main function will look something like this:

int main(){
    MyStruct *h_data, *d_data, *out_data;

    size_t structSize = sizeof(MyStruct);
    size_t intSize = sizeof(int);


    h_data = (MyStruct *) malloc(structSize * 1);
    h_data->b = 32;
    h_data->a = (int *)malloc(intSize * h_data->b);

    out_data = (MyStruct *) malloc(structSize * 1);
    out_data->b = 32;
    out_data->a = (int *)malloc(intSize * out_data->b);

    for(int i = 0; i<32; i++){
        h_data->a[i] = i;   
    }

    //Create temporary MyStruct object on host and allocate memory to its member "a" on device
    MyStruct temp;
    temp.b = h_data->b;
    checkCuda(cudaMalloc(&temp.a, 32 * sizeof(int)));

    //Copy host data to temp.a
    checkCuda(cudaMemcpy(temp.a, h_data->a, 32 * sizeof(int), cudaMemcpyHostToDevice));

    //Memory allocation for the device MyStruct
    checkCuda(cudaMalloc(&d_data, sizeof(MyStruct) * 1));
    //Copy actual object to device
    checkCuda(cudaMemcpy(d_data, &temp, sizeof(MyStruct) * 1, cudaMemcpyHostToDevice));


    structOperation<<<1,32>>>(d_data);

    //temp.a will be updated after kernel launch
    checkCuda(cudaMemcpy(out_data->a, temp.a, 32 * sizeof(int), cudaMemcpyDeviceToHost)); 

    printf("\nDataElements : ");
    for(int i = 0; i<32; i++)
    {
        printf("    %d",out_data->a[i]);
    }
    printf("\n");

    checkCuda(cudaFree(temp.a));
    checkCuda(cudaFree(d_data));

    free(h_data->a);
    free(out_data->a);
    free(h_data); 
    free(out_data);
}

Cudamemcpy Segmentation Fault