Cuda How to Get Grid, Block, Thread Size and Parallalize Non Square Matrix Calculation

CUDA how to get grid, block, thread size and parallalize non square matrix calculation

As you have written it, that kernel is completely serial. Every thread launched to execute it is going to performing the same work.

The main idea behind CUDA (and OpenCL and other similar "single program, multiple data" type programming models) is that you take a "data parallel" operation - so one where the same, largely independent, operation must be performed many times - and write a kernel which performs that operation. A large number of (semi)autonomous threads are then launched to perform that operation across the input data set.

In your array addition example, the data parallel operation is

C[k] = A[k] + B[k];

for all k between 0 and 128 * 1024. Each addition operation is completely independent and has no ordering requirements, and therefore can be performed by a different thread. To express this in CUDA, one might write the kernel like this:

__global__ void mAdd(float* A, float* B, float* C, int n)
{
    int k = threadIdx.x + blockIdx.x * blockDim.x;

    if (k < n)
        C[k] = A[k] + B[k];
}

[disclaimer: code written in browser, not tested, use at own risk]

Here, the inner and outer loop from the serial code are replaced by one CUDA thread per operation, and I have added a limit check in the code so that in cases where more threads are launched than required operations, no buffer overflow can occur. If the kernel is then launched like this:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / blocksize; // value determine by block size and total work

madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

Then 256 blocks, each containing 512 threads will be launched onto the GPU hardware to perform the array addition operation in parallel. Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.

All of the above is a hugely simplified overview of the CUDA paradigm for a very trivial operation, but perhaps it gives enough insight for you to continue yourself. CUDA is rather mature these days and there is a lot of good, free educational material floating around the web you can probably use to further illuminate many of the aspects of the programming model I have glossed over in this answer.

#blocks, #threads

The number of blocks is going to be the number of instances divided by the size of each block. However this may result in a non integer answer. So you have to make sure you round up so that each instance gets executed at the expense of wasting some resources.

So what you really want to do is apply this little integer arithmetic trick:

int nblocks = (n+blocksize-1)/blocksize;

How do I choose grid and block dimensions for CUDA kernels?

There are two parts to that answer (I wrote it). One part is easy to quantify, the other is more empirical.

Hardware Constraints:

This is the easy to quantify part. Appendix F of the current CUDA programming guide lists a number of hard limits which limit how many threads per block a kernel launch can have. If you exceed any of these, your kernel will never run. They can be roughly summarized as:

Each block cannot have more than 512/1024 threads in total (Compute Capability 1.x or 2.x and later respectively)
The maximum dimensions of each block are limited to
[512,512,64]/[1024,1024,64] (Compute 1.x/2.x or later)
Each block cannot consume more than 8k/16k/32k/64k/32k/64k/32k/64k/32k/64k registers total
(Compute 1.0,1.1/1.2,1.3/2.x-/3.0/3.2/3.5-5.2/5.3/6-6.1/6.2/7.0)
Each block cannot consume more than 16kb/48kb/96kb of shared memory (Compute
1.x/2.x-6.2/7.0)

If you stay within those limits, any kernel you can successfully compile will launch without error.

Performance Tuning:

This is the empirical part. The number of threads per block you choose within the hardware constraints outlined above can and does effect the performance of code running on the hardware. How each code behaves will be different and the only real way to quantify it is by careful benchmarking and profiling. But again, very roughly summarized:

The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware.
Each streaming multiprocessor unit on the GPU must have enough active warps to sufficiently hide all of the different memory and instruction pipeline latency of the architecture and achieve maximum throughput. The orthodox approach here is to try achieving optimal hardware occupancy (what Roger Dahl's answer is referring to).

The second point is a huge topic which I doubt anyone is going to try and cover it in a single StackOverflow answer. There are people writing PhD theses around the quantitative analysis of aspects of the problem (see this presentation by Vasily Volkov from UC Berkley and this paper by Henry Wong from the University of Toronto for examples of how complex the question really is).

At the entry level, you should mostly be aware that the block size you choose (within the range of legal block sizes defined by the constraints above) can and does have a impact on how fast your code will run, but it depends on the hardware you have and the code you are running. By benchmarking, you will probably find that most non-trivial code has a "sweet spot" in the 128-512 threads per block range, but it will require some analysis on your part to find where that is. The good news is that because you are working in multiples of the warp size, the search space is very finite and the best configuration for a given piece of code relatively easy to find.

Cuda/PyCuda - Large matrix traversal and block/grid size

Fix the block size, and keep the grid size dynamic. In this way, the kernel will cover each element of the matrix no matter what the values of M and N are.

block = (8,8)
grid = ((N + 7) / 8, (M + 7) / 8)

Launch the kernel with this grid and block configuration. Keeping in limits of the device, you may change the block size if desired.

Some CUDA computations fail with larger block dimension ( 1024)

The number of blocks is wrong. Indeed, COLS and threads.x are both integers. Thus, the result is a truncated integer. ceil<int> cannot ceil the result as it has already been truncated. This cause some blocks not to be computed: 15000 is divisible by 8 but not by 16. You need to either cast COLS to a floating-point number or compute the ceil result manually (safer). Here is an example:

dim3 blocks((COLS + threads.x - 1) / threads.x, (ROWS + threads.y - 1) / threads.y);

As pointed out in the comment, note that row * A.height + col is wrong: it should be row * A.width + col instead. This causes issues for non square matrices.

CUDA Matrix Addition Timing with varying block size

I think your main problem here is that the differences aren't statistically significant in the first place. For such small amounts of data, it is quite likely that the overhead of actually performing the kernel launch is dominating execution time. Note that all of your times are right around 3 ms regardless of the block size being used.

You could probably get somewhat more precise results by launching the kernel many times in a loop and averaging the execution time, but with such a small kernel invocation, that would probably only serve to confirm that all of the launches are executing in about the same amount of time due to launch and block scheduling overhead dominating actual kernel execution time.

In order to see any statistically significant results from the use of different block sizes, you're probably going to need to do something (much) more significant than just 16 million integer additions.

How to achieve memory coalescing when iterating over non-square 2D arrays?

when you make your array non-square, this calculation is no longer correct, for one of your two kernels:

blocks = ((nrows + (tpb - 1))//tpb, (ncols + (tpb - 1))//tpb)

For both thread and block dimensions, the x index dimension comes first, followed by y index dimension (referring to the x and y as they appear in the in-kernel built-in variables x and y).

For the usage in your first kernel:

out[x][y] = a[x][y] + b[x][y]

we want x to index through the rows. That is consistent with your grid definition.

For the usage in your second kernel:

out[y][x] = a[y][x] + b[y][x]

we want y to index through the rows. That is not consistent with your grid definition.

The result is out-of-bounds access on the second kernel call. The orientation of your rectangular grid does not match the orientation of your rectangular data.

In the square case, such reversal was immaterial, as both dimensions were the same.

Here is a possible "fix":

$ cat t62.py
from numba import cuda
import numpy as np

@cuda.jit
def uncoalesced_matrix_add(a, b, out):
    x, y = cuda.grid(2)
    out[x][y] = a[x][y] + b[x][y]

@cuda.jit
def coalesced_matrix_add(a, b, out):
    x, y = cuda.grid(2)
    out[y][x] = a[y][x] + b[y][x]

nrows, ncols = 512, 1024
tpb = 32

threads_per_block = (tpb, tpb)
blocksu = ((nrows + (tpb - 1))//tpb, (ncols + (tpb - 1))//tpb)
blocksc = ((ncols + (tpb - 1))//tpb, (nrows + (tpb - 1))//tpb)

size = nrows*ncols

a = np.arange(size).reshape(nrows, ncols).astype(np.int32)
b = np.ones(shape=a.shape, dtype=np.int32)
out = np.empty_like(a).astype(np.int32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(out)

uncoalesced_matrix_add[blocksu, threads_per_block](d_a, d_b, d_out)
slow = d_out.copy_to_host()

coalesced_matrix_add[blocksc, threads_per_block](d_a, d_b, d_out)
fast = d_out.copy_to_host()

print(np.array_equal(slow, fast))
# True
$ python t62.py
True
$

Also note that this grid sizing strategy only works for dimensions that are whole-number divisible by the block size.

Cuda How to Get Grid, Block, Thread Size and Parallalize Non Square Matrix Calculation