Why Is Cuda Pinned Memory So Fast

Why is CUDA pinned memory so fast?

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

As described here http://forums.nvidia.com/index.php?showtopic=164661

host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.

I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:

The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

CUDA - Pinned Memory vs. Pageable Memory Tradeoffs

Pinned memory is required if you want to overlap copy and compute.

In some situations, pinned memory may also provide a performance benefit. This is often noticeable if we can reuse the buffers that are used to transfer data between host and device.

you still have to copy content from pageable memory to pinned memory yourself which creates a lot of overhead.

I don't think you have to transfer data from pageable memory to pinned memory in every conceivable case.

Based on what appears to be dialog on your cross-posting here, I'll provide the following worked example showing a comparison between pinned and non-pinned memory:

$ cat t113.cu
#include <stdio.h>
#include <stdlib.h>

typedef double my_T;
const int ds = 1024;
const int num_iter = 100;
const int block_dim = 16;

// C = A * B
// naive!!
template <typename T>
__global__ void mm(const T * __restrict__ A, const T * __restrict__ B, T * __restrict__ C, size_t d)
{
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int idy = threadIdx.y+blockDim.y*blockIdx.y;

  if ((idx < d) && (idy < d)){
    T temp = 0;
    for (int i = 0; i < d; i++)
      temp += A[idy*d + i]*B[i*d + idx];
    C[idy*d + idx] = temp;
    }
}

int main(int argc, char *argv[]){

  int use_pinned = 0;
  if (argc > 1) use_pinned = atoi(argv[1]);
  if (use_pinned) printf("Using pinned memory\n");
  else printf("Using pageable memory\n");
  my_T *d_A, *d_B, *d_C, *h_A, *h_B, *h_C;
  int bs = ds*ds*sizeof(my_T);
  cudaMalloc(&d_A, bs);
  cudaMalloc(&d_B, bs);
  cudaMalloc(&d_C, bs);
  if (use_pinned){
    cudaHostAlloc(&h_A, bs, cudaHostAllocDefault);
    cudaHostAlloc(&h_B, bs, cudaHostAllocDefault);
    cudaHostAlloc(&h_C, bs, cudaHostAllocDefault);}
  else {
    h_A = (my_T *)malloc(bs);
    h_B = (my_T *)malloc(bs);
    h_C = (my_T *)malloc(bs);}
  cudaMemset(d_A, 0, bs);
  cudaMemset(d_B, 0, bs);
  memset(h_C, 0, bs);
  dim3 block(block_dim,block_dim);
  dim3 grid((ds+block.x-1)/block.x, (ds+block.y-1)/block.y);
  for (int iter = 0; iter<num_iter; iter++){
    mm<<<grid, block>>>(d_A, d_B, d_C, ds);
    if (iter > 1) if (h_C[0] != (my_T)((iter-2)*(iter-2)*ds)) printf("validation failure at iteration %d, was %f, should be %f\n", iter, h_C[0], (my_T) ((iter-2)*(iter-2)*ds));
    for (int i = 0; i < ds*ds; i++) {h_A[i] = iter; h_B[i] = iter;}
    cudaMemcpy(h_C, d_C, bs, cudaMemcpyDeviceToHost);
    cudaMemcpy(d_A, h_A, bs, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bs, cudaMemcpyHostToDevice);}
  printf("%s\n", cudaGetErrorString(cudaGetLastError()));
}
$ nvcc -arch=sm_60 -o t113 t113.cu
$ time ./t113
Using pageable memory
no error

real    0m1.987s
user    0m1.414s
sys     0m0.571s
$ time ./t113 1
Using pinned memory
no error

real    0m1.487s
user    0m0.903s
sys     0m0.579s
$

CUDA 9.1, CentOS 7.4, Tesla P100

Briefly, this code is doing 100 "naive" matrix-multiply operations on the GPU. At each iteration, we are launching the matrix-multiply on the GPU, and while that is being done we are updating the host (input) data. When the matrix multiply is complete, we transfer the results to the host, then transfer the new input data to the device, then perform another iteration.

I'm not suggesting that this code is perfectly optimized. The kernel for example is a naive implementation (if you wanted a fast matrix multiply, you should use CUBLAS). And if you were serious about optimization, you would probably want to overlap the data transfers in this example with device code execution. In that case, you would be forced to use pinned buffers anyway. But it's not always possible to achieve overlap of copy and compute in every application, and in certain cases (such as the provided example) using pinned buffers can help, performance-wise.

If you insist on comparing to a case where you must first copy data from a non-pinned buffer to a pinned buffer, then there may be no benefit. But without a concrete example of what you have in mind, it's not obvious to me that you can't do all of your work using only pinned host buffers (for data you intend to send to/from the GPU). If you are reading data in from disk or network, you could read it into pinned buffers. If you are doing some host calculations first, you could be using pinned buffers. Then send those pinned buffer data to the GPU.

Understanding memory transfer performance (CUDA)

Many CUDA operations can be crudely modeled as an "overhead" and a "duration". The duration is often predictable from the operation characteristics - e.g. the size of the transfer divided by the bandwidth. The "overhead" can be crudely modeled as a fixed quantity - e.g. 5 microseconds.

You graph consists of several measurements:

The "overhead" associated with initiating a transfer or "cycle". CUDA async ops generally have a minimum duration on the order of 5-50 microseconds. This is indicated in the "flat" left hand side of the blue curve. A "cycle" here represents two transfers, plus, in the case of the "kernel" version, the kernel launch overhead. The combination of these "overhead" numbers, represents the y-intercept of the blue and orange curves. The distance from the blue curve to the orange curve represents the addition of the kernel ops (which you haven't shown). On the left hand side of the curve, the operation sizes are so small that the contribution from the "duration" portion is small compared to the "overhead" constribution. This explains the approximate flatness of the curves on the left hand side.
The "duration" of operations. On the right hand side of the curves, the approximately linear region corresponds to the "duration" contribution as it becomes large and dwarfs the "overhead" cost. The slope of the blue curve should correspond to the PCIE transfer bandwidth. For a Gen4 system that should be approximately 20-24GB/s per direction (it has no connection to the 600GB/s of GPU memory bandwidth - it is limited by the PCIE bus.) The slope of the orange curve is also related to PCIE bandwidth, as this is the dominant contributor to the overall operation.
The "kernel" contribution. The distance between the blue and orange curves represent the contribution of the kernel ops, over/above just the PCIE data transfers.

What I don't understand is why the memory transfer only tests start ramping up exponentially at nearly the same data size point as the core limitations. The memory bandwidth for my device is advertised as 600 GB/s. Transferring 10 MB here takes on average ~1.5 milliseconds which isn't what napkin math would suggest given bandwidth.

The dominant transfer here is governed by the PCIE bus. That bandwidth is not 600GB/s but something like 20-24GB/s per direction. Furthermore, unless you are using pinned memory as the host memory for your transfers, the actual bandwidth will be about half of maximum achievable. This lines up pretty well with your measurement: 10MB/1.5ms = 6.6GB/s. Why does this make sense? You are transferring 10MB at a rate of ~10GB/s on the first transfer. Unless you are using pinned memory, the operation will block and will not execute concurrently with the 2nd transfer. Then you transfer 10MB at a rate of ~10GB/s on the second transfer. This is 20MB at 10GB/s, so we would expect to witness about a 2ms transfer time. Your actual transfer speeds might be closer to 12GB/s which would put the expectation very close to 1.5ms.

My expectation was that time would be nearly constant around the memory transfer latency, but that doesn't seem to be the case.

I'm not sure what that statement means, exactly, but for reasonably large transfer size, the time is not expected to be constant independent of the transfer size. The time should be a multiplier (the bandwidth) based on the transfer size.

I ran the memory only version with NSight Compute and confirmed that going from N=1000 KB to N=10000 KB increased average async transfer time from ~80 us to around ~800 us.

That is the expectation. Transferring more data takes more time. This is generally what you would observe if the "duration" contribution is significantly larger than the "overhead" contribution, which is true on the right hand side of your graph.

Here is a spreadsheet showing a specific example, using 12GB/s for PCIE bandwidth and 5 microseconds for the fixed operation overhead. The "total for 2 ops" column tracks your blue curve pretty closely:

Sample Image

Default Pinned Memory Vs Zero-Copy Memory

I think it depends on your application (otherwise, why would they provide both ways?)

Mapped, pinned memory (zero-copy) is useful when either:

The GPU has no memory on its own and uses RAM anyway
You load the data exactly once, but you have a lot of computation to perform on it and you want to hide memory transfer latencies through it.
The host side wants to change/add more data, or read the results, while kernel is still running (e.g. communication)
The data does not fit into GPU memory

Note that, you can also use multiple streams to copy data and run kernels in parallel.

Pinned, but not mapped memory is better:

When you load or store the data multiple times. For example: you have multiple subsequent kernels, performing the work in steps - there is no need to load the data from host every time.
There is not that much computation to perform and loading latencies are not going to be hidden well

Is CUDA pinned memory zero-copy?

Congratulations! You're encountering a 2.x compute capability + TCC + 64-bit OS feature with newer CUDA versions :)

Read the rest to find out more!

First a small theory summary as CUDA taught us:

Pinned memory is not zero-copy since the GPU cannot access it (it's not mapped in its address space) and it's used to efficiently transfer from the host to the GPU. It's page-locked (valuable kernel resource) memory and has some performance advantages over pageable normal memory.
Pinned zero-copy memory is page-locked memory (usually allocated with the cudaHostAllocMapped flag) which is also used by the GPU since mapped to its address space.

Why you're accessing memory allocated from the host from the device without explicitly specifying it?

Take a look at the release notes for CUDA 4.0 (and higher):

(Windows and Linux) Added support for unified virtual address space.

Devices supporting 64-bit and compute 2.0 and higher capability now
share a single unified address space between the host and all devices.
This means that the pointer used to access memory on the host is the
same as the pointer to used to access memory on the device. Therefore,
the location of memory may be queried directly from its pointer value;
the direction of a memory copy need not be specified.

To summarize: if your card is 2.0+ (and it is: https://developer.nvidia.com/cuda-gpus), you are running a 64-bit OS and on Windows you have a TCC mode on, you're automatically using UVA (Unified Virtual Addressing) between host and device. That means: automatically enhancing your code with zero-copy-like accesses.

This is also in the CUDA documentation for the current version in the paragraph "Automatic Mapping of Host Allocated Host Memory"

Behavior and performance of unified memory vs pinned host memory

Smart pointers for GPU memory

(Part of) your original motivation had been the possibility of using smart pointers for (global) GPU memory; and your Professor suggested using unified memory to that end (although it's not exactly clear to me how that would help). Well, the thing is, you don't have to reinvent the wheel for that - you can already have unique_ptr's for (different kinds of) CUDA GPU memory, as part of the cuda-api-wrappers library.

These unique pointers are actually std::unique_ptr(), but with custom deleters (and you create them with appropriate methods. You can find the a listing of the methods for creating them on this doxygen page (although the documentation is very partial at this point).

For an example of use, consider the CUDA samples example vectorAdd, which performs elementwise addition of two vectors to produce a third. Here is the same sample, using smart pointers for both the host and the device memory (and the API wrappers more generally ).

Caveat: I'm the author of the API wrapper library, so I'm biased in favor of using it :-)

(Partial) answers to your specific questions

Q1: What kind of CPU memory are we talking about [for unified memory allocations]? Is it pinned memory... Or... standard paged system memory?

I don't know, but you can easily find out by writing a small program that:

Allocates some managed memory.
Writes into it on the host side.
Prefetches it to the GPU, then exits.

... and profiling it to determine the PCIe bandwidth. With PCIe 3.0 and no intervening traffic, I usually get ~12 GB/sec from pinned memory and about half that from unpinned memory.

Q2: ... in CUDA 8.0 ... can I expect acceleration on the Maxwell architecture (with respect to host pinned memory)?

In my very limited experience, the performance of unified memory access cards does not improve in CUDA 8.0 relative to CUDA 6.0. (but there may be under-the-hood changes in prefetching logic or general code optimizations which do show improvements in some cases.) Regardless of that, remember that CUDA 6.0 doesn't support sm_52 targets so your question is a bit moot.

Q3: ... I can see that NVIDIA is putting a lot of work into developing unified memory. Therefore one might think that using unified memory is a better idea in long term perspective. Am I right?

I believe that you're wrong. As the CUDA Programming guide suggests, unified memory is a mechanism intended to simply memory accesses and programming; it sacrifices some speed for more uniform, simpler, code. While nVIDIA's efforts may reduce the overhead of using it somewhat, there's no mad optimization dash which would make that go away. On Kepler Tesla's, using unified memory is typically up to 1.8x-2x slower on various benchmarks; and even though I don't have figures for Maxwell or Pascal, I doubt this will drop so much as to make you prefer using unified memory across the board.

Q4: Is it true that each time I want to access single element of an array on host (while data reside on device) the whole array will be copied to host?

No, managed memory is paged; so only a single page will be copied across the PCIe bus. But if the array is small it could be the entire array.

Where is pinned memory allocated using cudaHostAlloc?

"Page-Locked Host Memory" for CUDA (and other DMA-capable external hardware like PCI-express cards) is allocated in physical memory of the Host computer. The allocation is marked as not-swappable (not-pageable) and not-movable (locked, pinned). This is similar to the action of mlock syscall "lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area."

This allocation can be accessed by kernel virtual address space (as kernel has full view of the physical memory) and this allocation is also added to the user process virtual address space to allow process access it.

When you does ordinary malloc, actual physical memory allocation may (and will) be postponed to the first (write) access to the pages. With mlocked/pinned memory all physical pages are allocated inside locking or pinning calls (like MAP_POPULATE in mmap: "Populate (prefault) page tables for a mapping"), and physical addresses of pages will not change (no swapping, no moving, no compacting...).

CUDA docs:
http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902

__host__ cudaError_t cudaHostAlloc ( void** pHost, size_t size, unsigned int flags )

Allocates page-locked memory on the host.
...

Allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

...

Memory allocated by this function must be freed with cudaFreeHost().

Pinned and not-pinned memory compared: https://www.cs.virginia.edu/~mwb7w/cuda_support/pinned_tradeoff.html "Choosing Between Pinned and Non-Pinned Memory"

Pinned memory is memory allocated using the cudaMallocHost function, which prevents the memory from being swapped out and provides improved transfer speeds. Non-pinned memory is memory allocated using the malloc function. As described in Memory Management Overhead and Memory Transfer Overhead, pinned memory is much more expensive to allocate and deallocate but provides higher transfer throughput for large memory transfers.

CUDA forums post with advises from txbob moderator: https://devtalk.nvidia.com/default/topic/899020/does-cudamemcpyasync-require-pinned-memory-/ "Does cudaMemcpyAsync require pinned memory?"

If you want truly asynchronous behavior (e.g. overlap of copy and compute) then the memory must be pinned. If it is not pinned, there won't be any runtime errors, but the copy will not be asynchronous - it will be performed like an ordinary cudaMemcpy.

The usable size may vary by system and OS. Pinning 4GB of memory on a 64GB system on Linux should not have a significant effect on CPU performance, after the pinning operation is complete. Attempting to pin 60GB on the other hand might cause significant system responsiveness issues.

Why Is Cuda Pinned Memory So Fast