cudaMemcpy segmentation fault
I believe I know what the problem is, but to confirm it, it would be useful to see the code that you are using to set up the Grid_dev
classes on the device.
When a class or other data structure is to be used on the device, and that class has pointers in it which refer to other objects or buffers in memory (presumably in device memory, for a class that will be used on the device), then the process of making this top-level class usable on the device becomes more complicated.
Suppose I have a class like this:
class myclass{
int myval;
int *myptr;
}
I could instantiate the above class on the host, and then malloc
an array of int
and assign that pointer to myptr
, and everything would be fine. To make this class usable on the device and the device only, the process could be similar. I could:
- cudaMalloc a pointer to device memory that will hold
myclass
- (optionally) copy an instantiated object of
myclass
on the host to the device pointer from step 1 using cudaMemcpy - on the device, use
malloc
ornew
to allocate device storage formyptr
The above sequence is fine if I never want to access the storage allocated for myptr
on the host. But if I do want that storage to be visible from the host, I need a different sequence:
- cudaMalloc a pointer to device memory that will hold
myclass
, let's call thismydevobj
- (optionally) copy an instantiated object of
myclass
on the host to the device pointermydevobj
from step 1 using cudaMemcpy - Create a separate int pointer on the host, let's call it
myhostptr
- cudaMalloc
int
storage on the device formyhostptr
- cudaMemcpy the pointer value of
myhostptr
from the host to the device pointer&(mydevobj->myptr)
After that, you can cudaMemcpy
the data pointed to by the embedded pointer myptr
to the region allocated (via cudaMalloc
) on myhostptr
Note that in step 5, because I am taking the address of this pointer location, this cudaMemcpy operation only requires the mydevobj
pointer on the host, which is valid in a cudaMemcpy operation (only).
The value of the device pointer myint
will then be properly set up to do the operations you are trying to do. If you then want to cudaMemcpy data to and from myint
to the host, you use the pointer myhostptr
in any cudaMemcpy calls, not mydevobj->myptr
. If we tried to use mydevobj->myptr
, it would require dereferencing mydevobj
and then using it to retrieve the pointer that is stored in myptr
, and then using that pointer as the copy to/from location. This is not acceptable in host code. If you try to do it, you will get a seg fault. (Note that by way of analogy, my mydevobj
is like your Grid_dev
and my myptr
is like your cdata
)
Overall it is a concept that requires some careful thought the first time you run into it, and so questions like this come up with some frequency on SO. You may want to study some of these questions to see code examples (since you haven't provided your code that sets up Grid_dev
):
- example 1
- example 2
- example 3
Segmentation fault on cudaMalloc or cudaMemcpy
For someone wondering what went wrong, I was able to fix it. I am not exactly sure what exactly was wrong but I had improper memory allocations at some places and in other cases I didn't even needed to use cudaMalloc
or cudaMemcpy
. Also, using What is the canonical way to check for errors using the CUDA runtime API? for checking errors instead of my own implementation worked. What I have now:
/***** KERNEL CONFIGURATION & MEMORY MANAGEMENT ******/
/***** GENERATE HASHED PASSWORD LIBRARY FOR COMPARE **/
unsigned int threads_per_block = 1024;
dim3 grid(1024, 1, 1);
dim3 threads(threads_per_block, 1, 1);
password* d_pwds;
ERROR_CHECK( cudaMalloc((void**) &d_pwds, pwds_size));
ERROR_CHECK( cudaMemcpy( d_pwds, h_pwds, pwds_size, cudaMemcpyHostToDevice));
libEntry* d_library;
ERROR_CHECK( cudaMalloc( (void**) &d_library, sizeof(libEntry) * count));
// generateLibraryKernel(int numPwds, password* pwds, libEntry* library)
generateLibraryKernel<<<grid, threads>>>(i, d_pwds, d_library);
ERROR_CHECK( cudaPeekAtLastError() );
ERROR_CHECK( cudaDeviceSynchronize() );
Where ERROR_CHECK
is defined from the link above.
#define ERROR_CHECK(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
I still don't fully understand memory management in CUDA (device and host allocations) but my code works now! Thank you all.
What causes this segmentation fault (core dumped) error at cudaMemcpy when copying to GPU?
The problem here relates to your usage of OpenCV. An item like CV_8U
is not a type, it is a compiler #define
. Therefore sizeof(CV_8U)
is not doing what you think it is doing. Your intended usage should be to capture the size of the underlying type (e.g. unsigned char
, i.e. a type size of 1). However, sizeof(CV_8U)
returns evidently the size of an integer, which is 4.
As a result of that, your calculation of size
is wrong (4x too large). As a result of that, when the cudaMemcpy
operation attempts to access &image.data[0]
for size
bytes, it will attempt to copy past the end of the buffer. For small images, the overrun doesn't trigger the run time check/limit. For a large enough size
calculation (large enough image) you will hit a seg fault. Although the failure is triggered within a CUDA call, the origin of the error is outside of CUDA.
One possible solution is to replace your usage of sizeof(CV_8U)
with something like sizeof(unsigned char)
. Since that size is 1, you can also just delete the multiplication by sizeof(CV_8U)
and get the same behavior.
You can also avoid this sort of allocation and let OpenCV do the allocation (and host-device data copying) work for you as demonstrated in the answer here and here
Getting segmentation fault with device malloc() in CUDA
You are passing incorrect arguments to cudaMemcpy
. This:
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
should be
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
or
cudaMemcpy(d_a, &a[0], size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b[0], size, cudaMemcpyHostToDevice);
and similarly this:
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
should be
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
or
cudaMemcpy(&c[0], d_c, size, cudaMemcpyDeviceToHost);
It will be the device to host copy which will be the source of your problem - it will overwrite the stack and cause the segfault you are seeing.
Simple operation on Structure in CUDA : Segmentation fault
There are several invalid memory access in the provided code.
- Accessing device memory (allocated using
cudaMalloc
) from host liked_data->a
will cause undefined behavior (segmentation fault etc.). cudaMemcpy
takes pointers as arguments, not address of pointer. SocudaMemcpy(&d_data, &h_data...
should be replaced withcudaMemcpy(d_data, h_data...
.
Allocating a device object with a device pointer as a member is a bit tricky. It can be achieved as follows:
- Allocate a temporary host object (
MyStruct temp
). - Allocate device memory to the member we want on device (
cudaMalloc(&temp.a, bytes)
). - Allocate device object (
cudaMalloc(&d_data, sizeof(MyStruct)
). - Copy temporary host object to the device object (
cudaMemcpy(d_data, &temp, sizeof(MyStruct), cudaMemcpyHostToDevice)
).
Keep in mind that when you modify the contents of d_data->a
on the device, temp.a
will also be modified because they are actually pointing to same memory location on device.
Your final main function will look something like this:
int main(){
MyStruct *h_data, *d_data, *out_data;
size_t structSize = sizeof(MyStruct);
size_t intSize = sizeof(int);
h_data = (MyStruct *) malloc(structSize * 1);
h_data->b = 32;
h_data->a = (int *)malloc(intSize * h_data->b);
out_data = (MyStruct *) malloc(structSize * 1);
out_data->b = 32;
out_data->a = (int *)malloc(intSize * out_data->b);
for(int i = 0; i<32; i++){
h_data->a[i] = i;
}
//Create temporary MyStruct object on host and allocate memory to its member "a" on device
MyStruct temp;
temp.b = h_data->b;
checkCuda(cudaMalloc(&temp.a, 32 * sizeof(int)));
//Copy host data to temp.a
checkCuda(cudaMemcpy(temp.a, h_data->a, 32 * sizeof(int), cudaMemcpyHostToDevice));
//Memory allocation for the device MyStruct
checkCuda(cudaMalloc(&d_data, sizeof(MyStruct) * 1));
//Copy actual object to device
checkCuda(cudaMemcpy(d_data, &temp, sizeof(MyStruct) * 1, cudaMemcpyHostToDevice));
structOperation<<<1,32>>>(d_data);
//temp.a will be updated after kernel launch
checkCuda(cudaMemcpy(out_data->a, temp.a, 32 * sizeof(int), cudaMemcpyDeviceToHost));
printf("\nDataElements : ");
for(int i = 0; i<32; i++)
{
printf(" %d",out_data->a[i]);
}
printf("\n");
checkCuda(cudaFree(temp.a));
checkCuda(cudaFree(d_data));
free(h_data->a);
free(out_data->a);
free(h_data);
free(out_data);
}
Related Topics
Variable Initialization in C++
Is Ncurses Available For Windows
Why Use Std::Bind Over Lambdas in C++14
How to Append Text to a Text File in C++
Overloading Member Access Operators -≫, .*
Py_Initialize Fails - Unable to Load the File System Codec
Unresolved External Symbol _Imp_Fprintf and _Imp_Iob_Func, Sdl2
How to Check If Input Is Numeric in C++
How to Use Stringstream to Separate Comma Separated Strings
Selectively Disable Gcc Warnings For Only Part of a Translation Unit
Tellg() Function Give Wrong Size of File
How Does Generic Lambda Work in C++14
Implementing Comparison Operators Via 'Tuple' and 'Tie', a Good Idea
Running My Program Says "Bash: ./Program Permission Denied"
How to Assume (Bool)True == (Int)1 for Any C++ Compiler