Cuda Performance Penalty When Running in Windows

CUDA performance penalty when running in Windows

There is a fair amount of overhead in sending GPU hardware commands through the WDDM stack.

As you've discovered, this means that under WDDM (only) GPU commands can get "batched" to amortize this overhead. The batching process may (probably will) introduce some latency, which can be variable, depending on what else is going on.

The best solution under windows is to switch the operating mode of the GPU from WDDM to TCC, which can be done via the nvidia-smi command, but it is only supported on Tesla GPUs and certain members of the Quadro family of GPUs -- i.e. not GeForce. (It also has the side effect of preventing the device from being used as a windows accelerated display adapter, which might be relevant for a Quadro device or a few specific older Fermi Tesla GPUs.)

AFAIK there is no officially documented method to circumvent or affect the WDDM batching process in the driver, but unofficially I've heard , according to Greg@NV in this link the command to issue after the cuda kernel call is cudaEventQuery(0); which may/should cause the WDDM batch queue to "flush" to the GPU.

As Greg points out, extensive use of this mechanism will wipe out the amortization benefit, and may do more harm than good.

EDIT: moving forward to 2016, a newer recommendation for a "low-impact" flush of the WDDM command queue would be cudaStreamQuery(stream);

EDIT2: Using recent drivers on windows, you should be able to place Titan family GPUs in TCC mode, assuming you have some other GPU set up for primary display. The nvidia-smi tool will allow you to switch modes (using nvidia-smi --help for more info).

Additional info about the TCC driver model can be found in the windows install guide, including that it may reduce the latency of kernel launches.

The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.

Performance penalty when invoking a cuda kernel

The host side overhead of a kernel launch uaing the runtime API is only about 15-30 microseconds on non-WDDM Windows platforms. On WDDM platforms (which I don't use), I understand it can be much, much higher, plus there is some sort of batching mechanism in the driver which tries to amortise the cost by doing multiple operations in a single driver side operation.

Generally, there will be a performance increase in "fusing" multiple data operations which would otherwise be done in separate kernels into a single kernel, where the algorithms allow it. The GPU has much higher arithmetic peak performance than peak memory bandwidth, so the more FLOPs which can be executed per memory transaction (and per kernel "setup code"), the better the performance of the kernel will be. On the other hand, trying to write a "swiss army knife" style kernel which tries to cram completely disparate operations into a single piece of code is never a particularly good idea, because it increases register pressure and reduce the efficiency of things like L1, constant memory and texture caches.

Which way you choose to go should really be guided by the nature of the code/algorithms. I don't believe there is a single "correct" answer to this question that can be applied in all circumstances.

Is there a performance penalty for CUDA method not running in sync?

CUDA runs threads in groups called warps. On all CUDA architectures that have been implemented so far (up to compute capability 3.5), the size of a warp is 32 threads. Only threads in different warps can truly be at different locations in the code. Within a warp, threads are always in the same location. Any threads that should not be executing the code in a given location are disabled as that code is executed. The disabled threads are then just taking up room in the warp and cause their corresponding processing cycles to be lost.

In your algorithm, you get warp divergence because the exit condition in the inner loop is not satisfied at the same time for all the threads in the warp. The GPU must keep executing the inner loop until the exit condition is satisfied for ALL the threads in the warp. As more threads in a warp reach their exit condition, they are disabled by the machine and represent lost processing cycles.

In some situations, the lost processing cycles may not impact performance, because disabled threads do not issue memory requests. This is the case if your algorithm is memory bound and the memory that would have been required by the disabled thread was not included in the read done by one of the other threads in the warp. In your case, though, the data is arranged in such a way that accesses are coalesced (which is a good thing), so you do end up losing performance in the disabled threads.

Your algorithm is very simple and, as it stands, the algorithm does not fit that well on the GPU. However, I think the same calculation can be dramatically sped up on both the CPU and GPU with a different algorithm that uses an approach more like that used in parallel reductions. I have not considered how that might be done in a concrete way though.

A simple thing to try, for a potentially dramatic increase in speed on the CPU, would be to alter your algorithm in such a way that the inner loop iterates forwards instead of backwards. This is because CPUs do cache prefetches. These only work when you iterate forwards through your data.

Pros and cons of CUDA on Linux vs Windows?

The NVidia tools on windows are nice.

The reason supercomputers use Linux is that the windows client license on 10,000 nodes gets pricey! There are also tradiationally a bunch of better tools to manage Linux clusters.

There is a general performance hit on windows just because there is lots of gui stuff you can't turn off. We measured 10-15% lower performance for a CPU bound task vs Linux running a command line.

The actual performance inside the Cuda task on the GPU 'should' be the same.

cuda kernel execution delayed by cpu code

I cannot understand how that fits into the statement that cuda kernels execute asynchronously with respect to the host?

You're experiencing WDDM Command Batching as described here.

In a nutshell, on windows, when in the WDDM driver model, GPU commands (e.g. anything from the cuda runtime API, plus kernel launches) will get sent to a command queue. Every so often, according to an unpublished heuristic, and with no explicit user-controls provided, the command queue will be "flushed" i.e. sent to the GPU, at which time (if not currently busy) the GPU will begin processing those commands.

So, on a WDDM setup, the dispatch of kernels to the command queue is non-blocking (control is returned to the CPU thread immediately). The dispatch of work from the command queue to the GPU follows some other heuristic. (The kernel execution is asynchronous to the host thread, in any event)

You have at least a few options if this is a problem:

On windows, switch to a GPU that is in the TCC driver model.
On windows, attempt to use one of the "hacks" described in the linked answer.
Switch to linux

Why is this code ten times slower on the GPU than CPU?

What follows is likely to be embarrassingly obvious to most developers working with CUDA, but may be of value to others - like myself - who are new to the technology.

The GPU code is ten times slower than the CPU equivalent because the GPU code exhibits a perfect storm of performance-wrecking characteristics.

The GPU code spends most of its time allocating memory on the GPU, copying data to the device, performing a very, very simple calculation (that is supremely fast irrespective of the type of processor it's running on) and then copying data back from the device to the host.

As noted in the comments, if an upper bound exists on the size of the data structures being processed, then a buffer on the GPU can be allocated exactly once and reused. In the code above, this takes the GPU to CPU runtime down from 10:1 to 4:1.

The remaining performance disparity is down to the fact that the CPU is able to perform the required calculations, in serial, millions of times in a very short time span due to its simplicity. In the code above, the calculation involves reading a value from an array, some multiplication, and finally an assignment
to an array element. Something this simple must be performed millions of times
before the benefits of doing so in parallel outweigh the necessary time penalty of transferring the data to the GPU and back. On my test system, a million array elements is the break even point, where GPU and CPU perform in (approximately) the same amount of time.

ArrayFire CUDA application is extremely slow in the first minute

ArrayFire uses JIT compilation at runtime to fuse multiple calls to functions. So when you perform an addition or any other element-wise operation, ArrayFire will create a custom kernel and execute this kernel. This has some overhead when you first generate this kernel but these kernels are cached and additional calls do not need to be compiled. Usually, it should only require a couple of iterations before additional compilations are not necessary. It's odd that the kernels are slow even after 60 or so iterations.

JIT kernels are evaluated using an internal heuristics based on memory and size of the kernels. Perhaps your application is not triggering the kernels optimally and causing additional kernel compilations. You could get around this by forcing the evaluation by calling the eval function on a variable. Here is a contrived example:

array a = randu(10, 10);
array b = randu(10, 10);
for(int i = 0; i < 100; i++) {
      a += b / 4;
      b *= i;
      eval(a, b);
}

Here you are evaluating the JIT tree for variable a and b at each iteration. This will reuse the same kernel at each iteration instead of creating a kernel for different multiples of iterations.

One thing to note is that element-wise, and some conditional functions like select and shift are JITed. Other functions force evaluation of their parameters before they are used. Also if you evaluate too often you will decrease the performance of your application.

CUDA timing kernels - how many launches?

If your code has variable execution paths (data-dependent, perhaps, and you're feeding it varying data), then nobody can really answer this for you.

If your code has a relatively constant execution path, I usually have pretty good results by timing things twice and throwing away the first set of results.

Various GPUs do have power management features, but the first time you run a kernel, any relevant features will be promoted to their highest state, and they won't change in the short time (microseconds) it takes to run that kernel again, for timing.

Benchmarking traditionalists would tell you to run a code hundreds or thousands of times and average the result. I'm rarely interested in that level of clarity. I can usually get a pretty good answer to how fast something is by timing the second run.

As an experiment, you might actually try and plot the data of the timing from each run for 500 runs. This might give you much more insight than any answer on SO can provide. If you see a big spike at the beginning, rather than try and average it out over a large number of runs, I'm usually more interested in discarding it - because it's not representative of the rest of my data.

Also, be aware that GPUs running under WDDM are just wacky in terms of timing. The OS is actually managing a WDDM GPU to a much finer degree than is really desirable for computing tasks, and so that might be a situation where you just have to give up and time lots of runs. You'll likely have much more consistent and predictable results run-to-run if you can run your GPU in TCC mode on windows (won't work with a GeForce GPU), or else on linux without X running on that GPU. (X can be running, just keep it off the compute GPUs, if you can.) In my opinion, timing is considerably more challenging under WDDM.

Cuda Performance Penalty When Running in Windows