Cuda, Using 2D and 3D Arrays

CUDA, Using 2D and 3D Arrays

Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.

cudaMallocPitch/cudaMemcpy2D:

First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.

general, dynamically allocated 2D case:

If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:

there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.

(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)

Also, here is a thrust method for building a general dynamically allocated 2D array.

flattening:

If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".

general, dynamically allocated 3D case:

As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).

special case: array width known at compile time:

Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.

doubly-subscripted host code, singly-subscripted device code:

Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).

From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.

How to use 2D Arrays in CUDA?

How to allocate 2D array:

int main(){
#define BLOCK_SIZE 16
#define GRID_SIZE 1
int d_A[BLOCK_SIZE][BLOCK_SIZE];
int d_B[BLOCK_SIZE][BLOCK_SIZE];

/* d_A initialization */

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); // so your threads are BLOCK_SIZE*BLOCK_SIZE, 256 in this case
dim3 dimGrid(GRID_SIZE, GRID_SIZE); // 1*1 blocks in a grid

YourKernel<<<dimGrid, dimBlock>>>(d_A,d_B); //Kernel invocation
}

How to traverse that array:

__global__ void YourKernel(int d_A[BLOCK_SIZE][BLOCK_SIZE], int d_B[BLOCK_SIZE][BLOCK_SIZE]){
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= h || col >= w)return;
/* whatever you wanna do with d_A[][] and d_B[][] */
}

i hope this is helpful

and also you can refer to CUDA Programming Guide page 22 about Matrix Multiplication

cudaMallocManaged for 2D and 3D array

You got a lot wrong in your attempt, so much that it was faster to write a working version than list out all the individual problems in the code in your question. So here is a working version of what it appears you were trying to do:

#include <algorithm>
#include <iostream>

const int N = 3;

__global__ void MatAdd(float A[][N], float B[][N], float C[][N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        C[i][j] = A[i][j] + B[i][j];
}

int main()
{
    float* A; cudaMallocManaged(&A, N*N*sizeof(float));
    float* B; cudaMallocManaged(&B, N*N*sizeof(float));
    float* C; cudaMallocManaged(&C, N*N*sizeof(float));

    const float A_vals[N][N]={{1,0,0},{0,1,0},{0,0,1}};
    const float B_vals[N][N]={{1,0,0},{0,1,0},{0,0,1}};
    float (*C_vals)[N] = reinterpret_cast<float (*)[N]>(C);

    std::copy(&A_vals[0][0], &A_vals[0][0] + N*N, A);
    std::copy(&B_vals[0][0], &B_vals[0][0] + N*N, B);

    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(1, 1);
    MatAdd<<<numBlocks, threadsPerBlock>>>( reinterpret_cast<float (*)[N]>(A),
                                            reinterpret_cast<float (*)[N]>(B),
                                            C_vals );

    cudaDeviceSynchronize();

    for(int i=0; i<N; i++) {
        for(int j=0; j<N; j++) {
            std::cout << C_vals[i][j] << "  ";
        }
        std::cout << std::endl;
    }

    return 0;
}

Some important points:

Managed memory allocation replaces standard host memory allocation and produces memory which is directly accessible on both the host and the device.
All arrays decay to a pointer when passed as arguments to a function by value. That decay is not recursive. See here for more details.
You can (and will need to) cast in order to use the [][] access syntax on linear memory allocated dynamically at runtime (this applies to malloc, new, or any of the CUDA host memory allocation APIs. See here for more details).
Initialization syntax and assignment syntax for arrays are not interchangeable.

All I can suggest is that you study it thoroughly until you understand how it works.

sending 3d array to CUDA kernel

First of all, I think talonmies when he posted the response to the previous question you mention, was not intending that to be representative of good coding. So figuring out how to extend it to 3D might not be the best use of your time. For example, why do we want to write programs which use exactly one thread? While there might be legitimate uses of such a kernel, this is not one of them. Your kernel has the possibility to do a bunch of independent work in parallel, but instead you are forcing it all onto one thread, and serializing it. The definition of the parallel work is:

a[i][j][k]=i+j+k;

Let's figure out how to handle that in parallel on the GPU.

Another introductory observation I would make is that since we are dealing with problems that have sizes that are known ahead of time, let's use C to tackle them with as much benefit as we can get from the language. Nested loops to do cudaMalloc may be needed in some cases, but I don't think this is one of them.

Here's a code that accomplishes the work in parallel:

#include <stdio.h>
#include <stdlib.h>
// set a 3D volume
// To compile it with nvcc execute: nvcc -O2 -o set3d set3d.cu
//define the data set size (cubic volume)
#define DATAXSIZE 100
#define DATAYSIZE 100
#define DATAZSIZE 20
//define the chunk sizes that each threadblock will work on
#define BLKXSIZE 32
#define BLKYSIZE 4
#define BLKZSIZE 4

// for cuda error checking
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            return 1; \
        } \
    } while (0)

// device function to set the 3D volume
__global__ void set(int a[][DATAYSIZE][DATAXSIZE])
{
    unsigned idx = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned idy = blockIdx.y*blockDim.y + threadIdx.y;
    unsigned idz = blockIdx.z*blockDim.z + threadIdx.z;
    if ((idx < (DATAXSIZE)) && (idy < (DATAYSIZE)) && (idz < (DATAZSIZE))){
      a[idz][idy][idx] = idz+idy+idx;
      }
}

int main(int argc, char *argv[])
{
    typedef int nRarray[DATAYSIZE][DATAXSIZE];
    const dim3 blockSize(BLKXSIZE, BLKYSIZE, BLKZSIZE);
    const dim3 gridSize(((DATAXSIZE+BLKXSIZE-1)/BLKXSIZE), ((DATAYSIZE+BLKYSIZE-1)/BLKYSIZE), ((DATAZSIZE+BLKZSIZE-1)/BLKZSIZE));
// overall data set sizes
    const int nx = DATAXSIZE;
    const int ny = DATAYSIZE;
    const int nz = DATAZSIZE;
// pointers for data set storage via malloc
    nRarray *c; // storage for result stored on host
    nRarray *d_c;  // storage for result computed on device
// allocate storage for data set
    if ((c = (nRarray *)malloc((nx*ny*nz)*sizeof(int))) == 0) {fprintf(stderr,"malloc1 Fail \n"); return 1;}
// allocate GPU device buffers
    cudaMalloc((void **) &d_c, (nx*ny*nz)*sizeof(int));
    cudaCheckErrors("Failed to allocate device buffer");
// compute result
    set<<<gridSize,blockSize>>>(d_c);
    cudaCheckErrors("Kernel launch failure");
// copy output data back to host

    cudaMemcpy(c, d_c, ((nx*ny*nz)*sizeof(int)), cudaMemcpyDeviceToHost);
    cudaCheckErrors("CUDA memcpy failure");
// and check for accuracy
    for (unsigned i=0; i<nz; i++)
      for (unsigned j=0; j<ny; j++)
        for (unsigned k=0; k<nx; k++)
          if (c[i][j][k] != (i+j+k)) {
            printf("Mismatch at x= %d, y= %d, z= %d  Host= %d, Device = %d\n", i, j, k, (i+j+k), c[i][j][k]);
            return 1;
            }
    printf("Results check!\n");
    free(c);
    cudaFree(d_c);
    cudaCheckErrors("cudaFree fail");
    return 0;
}

Since you've asked for it in the comments, here is the smallest number of changes I could make to your code to get it to work. Let's also remind ourselves of some of talonmies comments from the previous question you reference:

"For code complexity and performance reasons, you really don't want to do that, using arrays of pointers in CUDA code is both harder and slower than the alternative using linear memory."

"it is such a poor idea compared to using linear memory."

I had to diagram this out on paper to make sure I got all my pointer copying correct.

#include <cstdio>
inline void GPUassert(cudaError_t code, char * file, int line, bool Abort=true)
{
    if (code != 0) {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file,line);
        if (Abort) exit(code);
    }
}

#define GPUerrchk(ans) { GPUassert((ans), __FILE__, __LINE__); }



 __global__ void doSmth(int*** a) {
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++)
     a[i][j][k]=i+j+k;
 }
 int main() {
  int*** h_c = (int***) malloc(2*sizeof(int**));
  for(int i=0; i<2; i++) {
   h_c[i] = (int**) malloc(2*sizeof(int*));
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMalloc((void**)&h_c[i][j],2*sizeof(int)));
  }
  int ***h_c1 = (int ***) malloc(2*sizeof(int **));
  for (int i=0; i<2; i++){
    GPUerrchk(cudaMalloc((void***)&(h_c1[i]), 2*sizeof(int*)));
    GPUerrchk(cudaMemcpy(h_c1[i], h_c[i], 2*sizeof(int*), cudaMemcpyHostToDevice));
    }
  int*** d_c;
  GPUerrchk(cudaMalloc((void****)&d_c,2*sizeof(int**)));
  GPUerrchk(cudaMemcpy(d_c,h_c1,2*sizeof(int**),cudaMemcpyHostToDevice));
  doSmth<<<1,1>>>(d_c);
  GPUerrchk(cudaPeekAtLastError());
  int res[2][2][2];
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMemcpy(&res[i][j][0], h_c[i][j],2*sizeof(int),cudaMemcpyDeviceToHost));

  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++)
     printf("[%d][%d][%d]=%d\n",i,j,k,res[i][j][k]);
 }

In a nutshell, we have to do a successive sequence of:

malloc a multidimensional array of pointers (on the host), one dimension less than the problem size, with the last dimension being a set of pointers to regions cudaMalloc'ed onto the device rather than the host.
create another multidimensional array of pointers, of the same class as that created in the previous step, but one dimension less than that created in the previous step. this array must also have it's final ranks cudaMalloc'ed on the device.
copy the last set of host pointers from the second previous step into the area cudaMalloced on the device in the previous step.
repeat steps 2-3 until we end up with a single (host) pointer pointing to the multidimensional array of pointers, all of which are now resident on the device.

two-dimensional arrays in CUDA

You must not use the double** type in this case. Alternatively, you should use a flatten array that contains all the values of a given matrix in a double*-type variable.

The heart of the problem is located in the following line (and the similar next ones):

cudaMemcpy(a_d, a, d_size, cudaMemcpyHostToDevice);

Here you assume that a and a_d are compatible types, but they are not. A double**-typed variable is a pointer that refer to one or more pointers in memory (typically an array of pointer referencing many different double-typed arrays), while a double*-typed variable or a static 2D C array refer to a contiguous location in memory.

Note that you can access to a given (i,j) cell of a matrix using matrix[N*i+j], where N is the number of column, assuming matrix is a flatten matrix of type double* and use a row-major ordering.

Passing a 2D Array to CUDA Device and Using it

My understanding is CUDA accepts only linearized 2D arrays, so really a 1D array.

int  *my_array = new int[height*width];
for (int h = 0; h < height; h++){
    for (int w = 0; w < width; w++)
        my_array[width * h + w] = value;
}

You can then copy that to device memory as in the other answer.

Also, this question has more info:
Allocate 2D Array on Device Memory in CUDA.

2D multidimensional array passing to kernel, CUDA

As @jarod42 has pointed out, for an "automatic", "non-variable-length" C-style array as you have shown:

int values[2][3];

the storage format of such an array is identical to:

int values[2*3];

This means that we could treat that array as a linear singly-subscripted array (even though it is not):

for purpose of transfer from host to device:

#define W 3
#define H 2
int values[H][W];
int *d_values;
cudaMalloc(&d_values, H*W*sizeof(int));
cudaMemcpy(d_values, values, H*W*sizeof(int), cudaMemcpyHostToDevice);

and for purposes of access in device code, using "simulated" 2D access:

__global__ void kernel(int *values, int width, ...){
  int col = threadIdx.x+blockDim.x*blockIdx.x;
  int row = threadIdx.y+blockDim.y*blockIdx.y;
  int my_value = values[row*width+col];
  ...
}  

int main(){
  ...
  kernel<<<...>>>(d_values, W, ...);
  ...
  }

But based on the wording in your question:

Now I know that CUDA accepts 2D arrays in a linear form but how do I pass an already built array?

it seems you may be aware of the above approach, which I would generally refer to as "flattening" a 2D array to treat it in a linear fashion (perhaps with "simulated" 2D access).

In general, handling a 2D array of a width that is not known at compile time, while still allowing doubly-subscripted access in device code, is rather involved, and I would not recommend it, especially for CUDA beginners. But that is not actually the case you have presented:

a predefined 2D array to a kernel.

int values[2][3];
              ^
             the "width"

I take this to mean the "width" (i.e. the range of the 2nd ,i.e. the last, subscript) of the array is known at compile time. In that case we can leverage the compiler to generate the necessary array indexing for us to make the transfer and usage process only slightly more complicated than the "flattened" case, while still allowing doubly-subscripted access in the kernel:

$ cat t1023.cu
#include <stdio.h>

#define W 3
#define H 2
#define BSIZE 8

typedef int arrtype[W];

__global__ void kernel(arrtype *values, int width, int height){

  int col=threadIdx.x+blockDim.x*blockIdx.x;
  int row=threadIdx.y+blockDim.y*blockIdx.y;

  if ((row < height)&&(col < width)){

    int my_val = values[row][col]; //doubly-subscripted access
    printf("row: %d, col: %d, value: %d\n", row, col, my_val);
    }
}

int main(){

  int values[H][W];
  for (int i = 0; i < H; i++)
    for (int j = 0; j < W; j++)
      values[i][j] = i+j;
  arrtype *d_values;
  cudaMalloc(&d_values, H*W*sizeof(int));
  cudaMemcpy(d_values, values, H*W*sizeof(int), cudaMemcpyHostToDevice);
  dim3 block(BSIZE,BSIZE);
  dim3 grid((W+block.x-1)/block.x, (H+block.y-1)/block.y);
  kernel<<<grid,block>>>(d_values, W, H);
  cudaDeviceSynchronize();
  return 0;
}
$ nvcc -o t1023 t1023.cu
$ ./t1023
row: 0, col: 0, value: 0
row: 0, col: 1, value: 1
row: 0, col: 2, value: 2
row: 1, col: 0, value: 1
row: 1, col: 1, value: 2
row: 1, col: 2, value: 3
$

For a fully worked 3D (i.e. triply subscripted) example, see here

Looping over 3 dimensional arrays in CUDA to sum their elements

There is an example on page 21 of the CUDA 4.0 programming guide for looping over 2D array of floats:

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);


// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
   for (int r = 0; r < height; ++r) 
    {
       float* row = (float*)((char*)devPtr + r * pitch);
          for (int c = 0; c < width; ++c) 
              {
              float element = row[c];
              }
     }
}

rewrite it to sum up elements should be easy. Additionally you can refer to this thread. When efficiency is concern, you might also look on parallel reduction approach in CUDA. This is used for example when implementing Monte Carlo simulation (see Multi Monte Carlo example).

Cuda 2d or 3d arrays

I would recommend you to read the book "Cuda by Example". It goes through all these things that aren't documented as well and it'll explain the "how and why".

I think what you should use if you're rendering the result of the CUDA kernel is to use OpenGL interop. This way, your code processes the image on the GPU and leaves the processed data there, making it much faster to render. There's a good example of doing this in the book.

If each CUDA thread needs to read only one pixel from the first frame and one pixel from the next frame, you don't need to use textures. Textures only benefit you if each thread is reading in a bunch of consecutive pixels. So you're best off using a 3D array.

efficiently transferring multidimensional array to CUDA GPU

Since you've edited your question, I'll edit my response. Such an array (* *******A) is rather difficult to create. It requires nested loops with malloc, where the nesting level is equal to the array dimensionality. Having said that, the response is similar to what I have already posted below. Either you have a parallel set of nested loops that are doing the cudaMalloc and cudaMemcpy along the way, or else you linearize the whole thing and transfer in one step. For a two-dimensional array, I could possibly consider suggesting either approach. For an N-dimensional array, the first method is simply madness, as illustrated in this sequence of SO questions. Therefore, I think you should certainly linearize a large dimensional varying-row array before trying to transfer it to the device. The method of linearization is asked in the previous question you refer to and is outside of the scope of my answer here. Once linearized, the transfer operation is straightforward, and can be done with a single cudaMalloc/cudaMemcpy operation.

Presumably you are referring to arrays where the individual rows have different sizes (and are therefore malloc'ed independently). I think you have 2 choices:

Transfer the rows independently, with corresponding cudaMalloc (for
each row malloc) and a cudaMemcpy (for each cudaMalloc).
Combine (pack) the rows in host memory, so as to create one contiguous
block that is the size of the overall data set (the sum of the row
sizes). Then, using a single cudaMemcpy, transfer this "packed"
array to the device in one step. From a transfer efficiency
standpoint, this will be most efficient.

In either case, you will have to carefully consider the access mechanism to make the array conveniently available on the GPU. The first method may be easier in this respect, since you will automatically have pointers for each row. For the second method, you may need to create a set of pointers on the device to match your row pointers on the host. Beyond that, your access mechanism on the device should be similar to the host, since either will use a set of row pointers to access your array.

If instead you are referring to the ordinary multidimensional array (a[dim1][dim2][dim3]...) that is straightforward since it is already all contiguous in memory and accessible with a single pointer. If you remake the original varying-rows array as an ordinary multidimensional array whose number of columns is equal to the longest row (therefore leaving some elements unused in other rows), you could take advantage of this technique instead. This will have some inefficiency because you are transferring unused elements, but accessing the array would be straightforward.

If you have truly sparse matrices, you might also want to consider sparse matrix representation methods. cusp would be one method for handling and manipulating these on the GPU.

This answer may also be of interest.

Cuda, Using 2D and 3D Arrays