Why Is Nvidia Pascal Gpus Slow on Running Cuda Kernels When Using Cudamallocmanaged

Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.)

With previous architectures (e.g. Maxwell), managed allocations used by a particular kernel call will be migrated all at once, upon launch of the kernel, approximately as if you called cudaMemcpy to move the data yourself.

With CUDA 8 and Pascal GPUs, data migration occurs via demand-paging. At kernel launch, by default, no data is explicitly migrated to the device(*). When the GPU device code attempts to access data in a particular page that is not resident in GPU memory, a page fault will occur. The net effect of this page fault is to:

Cause the GPU kernel code (the thread or threads that accessed the page) to stall (until step 2 is complete)
Cause that page of memory to be migrated from the CPU to the GPU

This process will be repeated as necessary, as GPU code touches various pages of data. The sequence of operations involved in step 2 above involves some latency as the page fault is processed, in addition to the time spent to actually move the data. Since this process will move data a page at a time, it may be signficantly less efficient than moving all the data at once, either using cudaMemcpy or else via the pre-Pascal UM arrangement that caused all data to be moved at kernel launch (whether it was needed or not, and regardless of when the kernel code actually needed it).

Both approaches have their pros and cons, and I don't wish to debate the merits or various opinions or viewpoints. The demand-paging process enables a great many important features and capabilities for Pascal GPUs.

This particular code example, however, does not benefit. This was anticipated, and so the recommended use to bring the behavior in line with previous (e.g. maxwell) behavior/performance is to precede the kernel launch with a cudaMemPrefetchAsync() call.

You would use the CUDA stream semantics to force this call to complete prior to the kernel launch (if the kernel launch does not specify a stream, you can pass NULL for the stream parameter, to select the default stream). I believe the other parameters for this function call are pretty self-explanatory.

With this function call before your kernel call, covering the data in question, you should not observe any page-faulting in the Pascal case, and the profile behavior should be similar to the Maxwell case.

As I mentioned in the comments, if you had created a test case that involved two kernel calls in sequence, you would have observed that the 2nd call runs at approximately full speed even in the Pascal case, since all of the data has already been migrated to the GPU side through the first kernel execution. Therefore, the use of this prefetch function should not be considered mandatory or automatic, but should be used thoughtfully. There are situations where the GPU may be able to hide the latency of page-faulting to some degree, and obviously data already resident on the GPU does not need to be prefetched.

Note that the "stall" referred to in step 1 above is possibly misleading. A memory access by itself does not trigger a stall. But if the data requested is actually needed for an operation, e.g. a multiply, then the warp will stall at the multiply operation, until the necessary data becomes available. A related point, then, is that demand-paging of data from host to device in this fashion is just another "latency" that the GPU can possibly hide in it's latency-hiding architecture, if there is sufficient other available "work" to attend to.

As an additional note, in CUDA 9, the demand-paging regime for pascal and beyond is only available on linux; the previous support for Windows advertised in CUDA 8 has been dropped. See here. On windows, even for Pascal devices and beyond, as of CUDA 9, the UM regime is the same as maxwell and prior devices; data is migrated to the GPU en-masse, at kernel launch.

(*) The assumption here is that data is "resident" on the host, i.e. already "touched" or initialized in CPU code, after the managed allocation call. The managed allocation itself creates data pages associated with the device, and when CPU code "touches" these pages, the CUDA runtime will demand-page the necessary pages to be resident in host memory, so that the CPU can use them. If you perform an allocation but never "touch" the data in CPU code (an odd situation, probably) then it will actually already be "resident" in device memory when the kernel runs, and the observed behavior will be different. But that is not the case in view for this particular example/question.

Additional information is available in this blog article.

Is cudaMallocManaged slower than cudaMalloc?

cudaMallocManaged() is not about speeding up your application (with a few exceptions or corner cases, some are suggested below).

Today's implementation of Unified Memory and cudaMallocManaged will not be faster than intelligently written code written by a proficient CUDA programmer, to do the same thing. The machine (cuda runtime) is not smarter than you are as a programmer. cudaMallocManaged does not magically make the PCIE bus or general machine architectural limitations disappear.

Fast prototyping refers to the time it takes you to write the code, not the speed of the code.

cudaMallocManaged may be of interest to a proficient cuda programmer in the following situations:

You're interested in quickly getting a prototype together -i.e. you don't care about the last ounce of performance.
You are dealing with a complicated data structure which you use infrequently (e.g. a doubly linked list) which would otherwise be a chore to port to CUDA (since deep copies using ordinary CUDA code tend to be a chore). It's necessary for your application to work, but not part of the performance path.
You would ordinarily use zero-copy. There may be situations where using cudaMallocManaged could be faster than a naive or inefficient zero-copy approach.
You are working on a Jetson device.

cudaMallocManaged may be of interest to a non-proficient CUDA programmer in that it allows you to get your feet wet with CUDA along a possibly simpler learning curve. (However, note that naive usage of cudaMallocManaged may result in a CUDA kernels running slower than expected, see here and here.)

Although Maxwell is mentioned in the comments, CUDA UM will offer major new features with the Pascal generation of GPUs, in some settings, for some GPUs. In particular, Unified Memory in these settings will no longer be limited to the available GPU device memory, and the memory handling granularity will drop to the page level even when the kernel is running. You can read more about it here.

Is cudamalloc slower than cudamemcpy?

Is it possible that the large delay you are seeing (nearly 1s) is due to driver initialisation? It seems rather long for a cudaMalloc. Also check your driver is up-to-date.

The delay for the first kernel launch can be due to a number of factors:

Driver initialisation
PTX compilation
Context creation

The first of these is only applicable if you are running on a Linux system without X. In that case the driver is only loaded when required and unloaded afterwards. Running nvidia-smi -pm 1 as root will run the driver in persistent mode to avoid such delays, check out man nvidia-smi for details and remember to add this to an init script since it won't persist across a reboot.

The second delay is in compiling the PTX for the specific device architecture in your system. This is easily avoided by embedding the binary for your device architecture (or architectures if you want to support multiple archs without compiling PTX) into the executable. See the CUDA C Programming Guide (available on NVIDIA website) for more information, section 3.1.1.2 talks about JIT compilation.

The third point, context creation, is unavoidable but NVIDIA have gone to great effort to reduce the cost. Context creation involves copying the executable code to the device, copying any data objects, setting up the memory system etc.

Does cudaMallocManaged() create a synchronized buffer in RAM and VRAM?

So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?

Yes, more or less. The "synchronization" is referred to in the managed memory model as migration of data. Virtual address carveouts are made for all visible processors, and the data is migrated (i.e. moved to, and provided a physical allocation for) the processor that attempts to access it.

If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?

If you never need to touch the buffer on the CPU, then what will happen is that the VA carveout will be made in the CPU VA space, but no physical allocation will be made for it. When the GPU attempts to actually access the data, it will cause the allocation to "appear" and use up GPU memory. Although there are "costs" to be sure, there is no usage of CPU (physical) memory in this case. Furthermore, once instantiated in GPU memory, there should be no ongoing additional cost for the GPU to access it; it should run at "full" speed. The instantiation/migration process is a complex one, and what I am describing here is what I would consider the "principal" modality or behavior. There are many factors that could affect this.

Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?

No, this is managed by the runtime, not compile time.

Or do I have it all wrong? Are we not even talking VRAM? How does this work?

No you don't have it all wrong. Yes we are talking about VRAM.

The blog you reference barely touches on managed memory, which is a fairly involved subject. There are numerous online resources to learn more about it. You might want to review some of them. here is one. There are good GTC presentations on managed memory, including here. There is also an entire section of the CUDA programming guide covering managed memory.

Is cudaDeviceSynchronize() required before cudaFree()?

Some CUDA API calls such as cudaMalloc(), cudaFree(), cudaHostAlloc(), device to device copies etc. change the virtual memory address mapping of GPU. These calls are causing device synchronization, so that you don't need to call cudaDeviceSynchronize() because it is already synchronizing.

The same happens with the cudaMallocManaged().

Cuda Unified memory vs cudaMalloc

One problem with your posted code is that you are not doing a cudaMemPrefetchAsync on the output data from the FFT. According to my testing, this makes a significant difference. There were a few other problems with your code, for example we do not call cudaFree on a pointer allocated with malloc.

Here's a complete code built around what you have shown. When I run this on CentOS7.4, CUDA 9.1, Tesla P100, I get comparable times for the FFT performed in the managed memory case (3.52ms) vs. the FFT performed in the non-managed memory case (3.45ms):

$ cat t43.cu
#include <cufft.h>
#include <iostream>
#include <string>

//using namespace std;
const int dataSize  = 1048576*32;
void setupWave(const int ds, cufftComplex *d){
  for (int i = 0; i < ds; i++){
    d[i].x = 1.0f;
    d[i].y = 0.0f;}
}
int main(){

cufftComplex *inData, *outData;

cufftHandle plan;
cufftPlan1d(&plan, dataSize, CUFFT_C2C, 1);

cudaMallocManaged(&inData, dataSize * sizeof(cufftComplex));
cudaMallocManaged(&outData, dataSize * sizeof(cufftComplex));

cudaEvent_t start_before_memHtoD, start_kernel, stop_kernel,
                stop_after_memDtoH;
cudaEventCreate(&start_kernel);
cudaEventCreate(&start_before_memHtoD);
cudaEventCreate(&stop_kernel);
cudaEventCreate(&stop_after_memDtoH);

setupWave(dataSize, inData);

cudaEventRecord(start_before_memHtoD);
cudaMemPrefetchAsync(inData, dataSize * sizeof(cufftComplex), 0);
cudaMemPrefetchAsync(outData, dataSize * sizeof(cufftComplex), 0);
cudaDeviceSynchronize();

cudaEventRecord(start_kernel);

cufftExecC2C(plan, inData, outData, CUFFT_FORWARD);

cudaEventRecord(stop_kernel);

cudaEventSynchronize(stop_kernel);

float sum = 0;
for (int i = 0; i < dataSize; i++) {
        sum += outData[i].x + outData[i].y;
}
cudaEventRecord(stop_after_memDtoH);
cudaEventSynchronize(stop_after_memDtoH);

std::cout << "sum for UM is " << sum << std::endl;

float umTime = 0;
float overallUmTime = 0;
cudaEventElapsedTime(&umTime, start_kernel, stop_kernel);
cudaEventElapsedTime(&overallUmTime, start_before_memHtoD,
                stop_after_memDtoH);

std::string resultString_um = std::to_string(dataSize) + " samples took " + std::to_string(umTime) + "ms,  Overall: " + std::to_string(overallUmTime) + "\n";

std::cout << resultString_um;
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaFree(inData);
cudaFree(outData);
cudaEventDestroy(start_before_memHtoD);
cudaEventDestroy(stop_after_memDtoH);

cufftDestroy(plan);

cufftComplex *d_inData;
cufftComplex *d_outData;
inData = (cufftComplex*) (malloc(sizeof(cufftComplex) * dataSize));
outData = (cufftComplex*) (malloc(sizeof(cufftComplex) * dataSize));
cudaMalloc((void**) (&d_inData), dataSize * sizeof(cufftComplex));
cudaMalloc((void**) (&d_outData), dataSize * sizeof(cufftComplex));
//cufftHandle plan;
cufftPlan1d(&plan, dataSize, CUFFT_C2C, 1);

//cudaEvent_t start_before_memHtoD, start_kernel, stop_kernel,
//                stop_after_memDtoH;
cudaEventCreate(&start_kernel);
cudaEventCreate(&start_before_memHtoD);
cudaEventCreate(&stop_kernel);
cudaEventCreate(&stop_after_memDtoH);

setupWave(dataSize, inData);

cudaEventRecord(start_before_memHtoD);
cudaMemcpy(d_inData, inData, dataSize * sizeof(cufftComplex),
                                        cudaMemcpyHostToDevice);
cudaEventRecord(start_kernel);

cufftExecC2C(plan, d_inData, d_outData, CUFFT_FORWARD);

cudaEventRecord(stop_kernel);

cudaEventSynchronize(stop_kernel);

cudaMemcpy(outData, d_outData, dataSize * sizeof(cufftComplex),
                cudaMemcpyDefault);

 sum = 0;
for (int i = 0; i < dataSize; i++) {
        sum += outData[i].x + outData[i].y;
}
cudaEventRecord(stop_after_memDtoH);
cudaEventSynchronize(stop_after_memDtoH);

std::cout << "sum for non-UM is " << sum << std::endl;

//float umTime = 0;
//float overallUmTime = 0;
cudaEventElapsedTime(&umTime, start_kernel, stop_kernel);
cudaEventElapsedTime(&overallUmTime, start_before_memHtoD,
                stop_after_memDtoH);

resultString_um = std::to_string(dataSize) + " samples took "
                + std::to_string(umTime) + "ms,  Overall: "
                + std::to_string(overallUmTime) + "\n";
std::cout << resultString_um;
free(outData);
free(inData);
cudaFree(d_outData);
cudaFree(d_inData);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);

cudaEventDestroy(start_before_memHtoD);
cudaEventDestroy(stop_after_memDtoH);

cufftDestroy(plan);

}
$ nvcc -std=c++11 -arch=sm_60 -o t43 t43.cu -lcufft
$ ./t43
sum for UM is 3.35544e+07
33554432 samples took 3.520640ms,  Overall: 221.909988
sum for non-UM is 3.35544e+07
33554432 samples took 3.456160ms,  Overall: 278.099426
$

Why Is Nvidia Pascal Gpus Slow on Running Cuda Kernels When Using Cudamallocmanaged