How Performing Multiple Matrix Multiplications in Cuda

How performing multiple matrix multiplications in CUDA?

I think it's likely that the fastest performance will be achieved by using the CUBLAS batch gemm function which was specifically designed for this purpose (performing a large number of "relatively small" matrix-matrix multiply operations).

Even though you want to multiply your array of matrices (M[]) by a single matrix (N), the batch gemm function will require you to pass also an array of matrices for N (i.e. N[]), which will all be the same in your case.

EDIT: Now that I have worked thru an example, it seems clear to me that with a modification to the example below, we can pass a single N matrix and have the GPU_Multi function simply send the single N matrix to the device, while passing an array of pointers for N, i.e. d_Narray in the example below, with all of the pointers pointing to the same N matrix on the device.

Here is a fully worked batch GEMM example:

#include <stdio.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <assert.h>

#define ROWM 4
#define COLM 3
#define COLN 5

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

typedef float mytype;
// Pi = Mi x Ni
// pr = P rows = M rows
// pc = P cols = N cols
// mc = M cols = N rows
void GPU_Multi(mytype **M, mytype **N, mytype **P
  , size_t pr, size_t pc, size_t mc
  , size_t num_mat, mytype alpha, mytype beta)
{

    mytype *devM[num_mat];
    mytype *devN[num_mat];
    mytype *devP[num_mat];
    size_t p_size =sizeof(mytype) *pr*pc;
    size_t m_size =sizeof(mytype) *pr*mc;
    size_t n_size =sizeof(mytype) *mc*pc;
    const mytype **d_Marray, **d_Narray;
    mytype **d_Parray;
    cublasHandle_t myhandle;
    cublasStatus_t cublas_result;

    for(int i = 0 ; i < num_mat; i ++ )
    {
        cudaMalloc((void**)&devM[ i ], m_size );
        cudaMalloc((void**)&devN[ i ], n_size );
        cudaMalloc((void**)&devP[ i ], p_size );
    }
    cudaMalloc((void**)&d_Marray, num_mat*sizeof(mytype *));
    cudaMalloc((void**)&d_Narray, num_mat*sizeof(mytype *));
    cudaMalloc((void**)&d_Parray, num_mat*sizeof(mytype *));
    cudaCheckErrors("cudaMalloc fail");
    for(int i = 0 ; i < num_mat; i ++ ) {

        cudaMemcpy(devM[i], M[i], m_size , cudaMemcpyHostToDevice);
        cudaMemcpy(devN[i], N[i], n_size , cudaMemcpyHostToDevice);
        cudaMemcpy(devP[i], P[i], p_size , cudaMemcpyHostToDevice);
    }
    cudaMemcpy(d_Marray, devM, num_mat*sizeof(mytype *), cudaMemcpyHostToDevice);
    cudaMemcpy(d_Narray, devN, num_mat*sizeof(mytype *), cudaMemcpyHostToDevice);
    cudaMemcpy(d_Parray, devP, num_mat*sizeof(mytype *), cudaMemcpyHostToDevice);
    cudaCheckErrors("cudaMemcpy H2D fail");
    cublas_result = cublasCreate(&myhandle);
    assert(cublas_result == CUBLAS_STATUS_SUCCESS);
    // change to    cublasDgemmBatched for double
    cublas_result = cublasSgemmBatched(myhandle, CUBLAS_OP_N, CUBLAS_OP_N
      , pr, pc, mc
      , &alpha, d_Marray, pr, d_Narray, mc
      , &beta, d_Parray, pr
      , num_mat);
    assert(cublas_result == CUBLAS_STATUS_SUCCESS);

    for(int i = 0 ; i < num_mat ; i ++ )
    {
        cudaMemcpy(P[i], devP[i], p_size, cudaMemcpyDeviceToHost);
        cudaFree(devM[i]);
        cudaFree(devN[i]);
        cudaFree(devP[i]);
    }
    cudaFree(d_Marray);
    cudaFree(d_Narray);
    cudaFree(d_Parray);
    cudaCheckErrors("cudaMemcpy D2H fail");

}

int main(){

  mytype h_M1[ROWM][COLM], h_M2[ROWM][COLM];
  mytype h_N1[COLM][COLN], h_N2[COLM][COLN];
  mytype h_P1[ROWM][COLN], h_P2[ROWM][COLN];
  mytype *h_Marray[2], *h_Narray[2], *h_Parray[2];
  for (int i = 0; i < ROWM; i++)
    for (int j = 0; j < COLM; j++){
      h_M1[i][j] = 1.0f; h_M2[i][j] = 2.0f;}
  for (int i = 0; i < COLM; i++)
    for (int j = 0; j < COLN; j++){
      h_N1[i][j] = 1.0f; h_N2[i][j] = 1.0f;}
  for (int i = 0; i < ROWM; i++)
    for (int j = 0; j < COLN; j++){
      h_P1[i][j] = 0.0f; h_P2[i][j] = 0.0f;}

  h_Marray[0] = &(h_M1[0][0]);
  h_Marray[1] = &(h_M2[0][0]);
  h_Narray[0] = &(h_N1[0][0]);
  h_Narray[1] = &(h_N2[0][0]);
  h_Parray[0] = &(h_P1[0][0]);
  h_Parray[1] = &(h_P2[0][0]);

  GPU_Multi(h_Marray, h_Narray, h_Parray, ROWM, COLN, COLM, 2, 1.0f, 0.0f);
  for (int i = 0; i < ROWM; i++)
    for (int j = 0; j < COLN; j++){
      if (h_P1[i][j] != COLM*1.0f)
      {
        printf("h_P1 mismatch at %d,%d was: %f should be: %f\n"
          , i, j, h_P1[i][j], COLM*1.0f); return 1;
      }
      if (h_P2[i][j] != COLM*2.0f)
      {
        printf("h_P2 mismatch at %d,%d was: %f should be: %f\n"
          , i, j, h_P2[i][j], COLM*2.0f); return 1;
      }
    }
  printf("Success!\n");
  return 0;
}

matrix multiplication in cuda

As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.

If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.

The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.

Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.

The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.

There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.

Doing multiple matrix-matrix multiplications in one operation

Have you looked at the latest CUBLAS (version 4.1)? It includes a new batched GEMM mode specifically intended for large batches of small matrix-matrix multiplies. I would suggest doing a pairwise multiplication tree as Jonathan Dursi suggested in his answer, using the CUBLAS batched API to accelerate it, rather than writing your own custom kernel as he suggests.

CUBLAS 4.1 is included with the CUDA Toolkit v4.1.

CUBLAS BATCHED GEMM API IMPROVES PERFORMANCE OF BATCHES OF SMALL MATRICES

CUDA tiled matrix multiplication explanation

Here is a drawing to understand the values set to the first variables of the CUDA kernel and the overall computation performed:
CUDA matrix computation

Matrices are stored using a row-major ordering. The CUDA code assume the matrix sizes can be divided by BLOCK_SIZE.

The matrices A, B and C are virtually split in blocks according to the kernel CUDA grid. All blocks of C can be computed in parallel. For a given dark-grey block of C, the main loop walk through the several light-grey blocks of A and B (in lockstep). Each block is computed in parallel using BLOCK_SIZE * BLOCK_SIZE threads.

bx and by are the block-based position of the current block within the CUDA grid.
tx and ty are the cell-based position of the cell computed by the current thread within the current computed block of the CUDA grid.

Here is a detailed analysis for the aBegin variable:
aBegin refers to the memory location of the first cell of the first computed block of the matrix A. It is set to wA * BLOCK_SIZE * by because each block contains BLOCK_SIZE * BLOCK_SIZE cells and there is wA / BLOCK_SIZE blocks horizontally and by blocks above the current computed block of A. Thus, (BLOCK_SIZE * BLOCK_SIZE) * (wA / BLOCK_SIZE) * by = BLOCK_SIZE * wA * by.

The same logic apply for bBegin:
it is set to BLOCK_SIZE * bx because there is bx block of size BLOCK_SIZE in memory before the first cell of the first computed block of the matrix B.

a is incremented by aStep = BLOCK_SIZE in the loop so that the next computed block is the following on the right (on the drawing) of the current computed block of A. b is incremented by bStep = BLOCK_SIZE * wB in the same loop so that the next computed block is the following of the bottom (on the drawing) of the current computed block of B.

CUDA Matrix Multiplication using Parallel Reduction

Why the code is returning incorrect results.

In the last step of your reduction (e < 32) you are breaking with your method. This leads to race conditions on the same element of result, e.g.

result[e] += result[e + 16];

For e==0 this means read result[16], whereas in the same step/at the same time for e==16 the operation means write result[16].

To avoid the race condition you have two options:

Use a pointer that is declared volatile like in the document you
linked (p. 78) [edited]
Continue the if( e < ...) as before or transform all the ifs into a loop with:
```
for(int size=blockdim.x/2; size>0; size/=2)
  if(e < size)
    ...
```

Why it is taking much longer to run than the method using shared memory described in Chapter 3, page 25 of the CUDA C Programming
Guide.

Accessing shared memory is much faster than accessing global memory. You are storing your intermediate result in shared memory, whereas the example you reference is storing the parts of the matrices to be read. In the example this is combined with loop tiling and every thread only loads one element of the whole tile from global memory, but later reads TILE_WIDTH * 2 elements.

The higher performance of the example directly comes from a reduced time waiting for data to be loaded from global memory.

Matrix Multiplication using CUDA

I figured out what was wrong. Let's analyze it :

Point 1 : The quest to remove the ever monotonic "zero value"

As noted, you must replace printf("%d \n", P[i]); as printf("%f \n", P[i]);

Point 2 : Why the program fails with a value of Width 512 ?

Actually it will fail for even a small value such as 23. Why ? Because 23*23 is > 512 (The maximum number of threads that a GPU can have per block as of today!)

cuda matrix multiplication by columns

In order to fix this issue, I used atomicAdd to do the += operation. When a thread performs the operation 3*1 += a (for example), it does three things.

It gets the previous value of a

It updates the value by doing 3*1 + previous value of a

It then stores the new value into a

By using atomicAdd it guarantees that these operations can occur by the thread without interruption from other threads. If atomicAdd is not used, thread0 could get the previous value of a and while thread0 is updating the value, thread1 could get the previous value of a and perform its own update. In this way a += operation would not occur because the threads aren't able to finish their operations.

If a += 3*1 is used instead of atomicAdd(&a, 3*1), then it is possible for thread1 to interfere and change the value of thread0 before thread0 finishes what it's doing. It creates a race condition.

atomicAdd is a += operation. You would use the following code to perform the operation:

__global__ void kernel(){    
int a = 0;   
atomicAdd(&a, 3*1);  //is the same as a += 3*1
}

How Performing Multiple Matrix Multiplications in Cuda