What Is the Fastest Way to Transpose a Matrix in C++

What is the fastest way to transpose a matrix in C++?

This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.

First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)

void transpose(float *src, float *dst, const int N, const int M) {
    #pragma omp parallel for
    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
        dst[n] = src[M*j + i];
    }
}

Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.

for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*l+j];
        }
        C[K*i + j] = tmp;
    }
}

That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first

transpose(B);
for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*j+l];
        }
        C[K*i + j] = tmp;
    }
}
transpose(B);

Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.

I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.

For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is

Smear image horizontally
transpose output 
Smear output horizontally
transpose output

Here is a paper by Intel explaining that
http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use

void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
    #pragma omp parallel for
    for(int n=0; n<M*N; n++) {
        int k = vec_size*(n/N/vec_size);
        int i = (n/vec_size)%N;
        int j = n%vec_size;
        B[n] = A[M*i + k + j];
    }
}

EDIT:

I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16 (Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).

inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<block_size; i++) {
        for(int j=0; j<block_size; j++) {
            B[j*ldb + i] = A[i*lda +j];
        }
    }
}

inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
        }
    }
}

The values lda and ldb are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this

#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);

float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);

For 3000x1001 this returns ldb = 3008 and lda = 1008

Edit:

I found an even faster solution using SSE intrinsics:

inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
    __m128 row1 = _mm_load_ps(&A[0*lda]);
    __m128 row2 = _mm_load_ps(&A[1*lda]);
    __m128 row3 = _mm_load_ps(&A[2*lda]);
    __m128 row4 = _mm_load_ps(&A[3*lda]);
     _MM_TRANSPOSE4_PS(row1, row2, row3, row4);
     _mm_store_ps(&B[0*ldb], row1);
     _mm_store_ps(&B[1*ldb], row2);
     _mm_store_ps(&B[2*ldb], row3);
     _mm_store_ps(&B[3*ldb], row4);
}

inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            int max_i2 = i+block_size < n ? i + block_size : n;
            int max_j2 = j+block_size < m ? j + block_size : m;
            for(int i2=i; i2<max_i2; i2+=4) {
                for(int j2=j; j2<max_j2; j2+=4) {
                    transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
                }
            }
        }
    }
}

How to transpose a matrix in C? - error

The main problem is that you confused with sizes of matrices.

When you fill workaround matrix your loop should be like this, since size of original matrix is (height x width).

for(i=0;i<width;i++)
{
    for(j=0;j<height;j++)
    {
        result = (*((array2+i*width)+j));
        workaround[i][j] = result;
    }
}

When transposing matrix

for(i=0;i<height2;i++)
{
    for(j=0;j<width2;j++)
    {
       *((output+i*width2)+j) = workaround[j][i];
       printf("%f\t",(*((output+i*width2)+j)));
    }
}

In memory allocation you also got wrong sizes, it should be

output = malloc(heightOutput * sizeof(double *)); //rows from 1
for (i = 0; i < heightOutput; i++)
{
    output[i] = malloc(widthOutput * sizeof(double)); //columns
}

With this changes your program will run without errors, but output is still wil be wrong

input matrix:                                                                                                                                                                    
1.000000        2.000000        3.000000                                                                                                                                         
4.000000        5.000000        6.000000                                                                                                                                         

1.000000        2.000000        3.000000                                                                                                                                         
4.000000        5.000000        6.000000                                                                                                                                         


output matrix                                                                                                                                                                    
1.000000        4.000000                                                                                                                                                         
5.000000        0.000000                                                                                                                                                         
0.000000        0.000000

The last problem is with argument passing. You dynamically allocate memory, pointers to rows first

output = malloc(heightOutput * sizeof(double *)); //rows from 1

With this you got array of pointers

* -> NULL
* -> NULL
* -> NULL

and with loop you assign values to them

for (i = 0; i < heightOutput; i++)
{
    output[i] = malloc(widthOutput * sizeof(double)); //columns
}

* -> [e00, e01]
* -> [e10, e11]
* -> [e20, e21]

But there is no guarantee that they will be allocated one after another, still you operate with output like with linear allocated data. To correctly work with this you need to pass double pointer.

void transposeMatrix(double* array2, int height, int width, double ** output, int height2, int width2)
{
    ...
    for(i=0;i<width2;i++)
    {
        for(j=0;j<height2;j++)
        {
            output[j][i] = workaround[i][j];
            printf("%f\t",output[j][i]);
        }
        printf("\n");
    }
}

If you want to allocate memory linear do it like follows

output = malloc(heightOutput * widthOutput * sizeof(double));

But to me all of it looks kind of complicated, simply it would looks like

void transposeMatrix(double* src, double* dst, int n, int m)
{
    int i, j;
    for(i = 0; i < n; ++i)
        for(j = 0; j < m; ++j)
            dst[j * n + i] = src[i * m + j];
}

int main(void)
{
    double array[2][3] = {{1,2,3},{4,5,6}};
    int height = 2;
    int width = 3;
    int i, j;

    double *output = malloc(height * width * sizeof(double));

    printf("input matrix\n");
    for(i=0;i<height;i++)
    {
        for(j=0;j<width;j++)
        {
            printf("%f\t",array[i][j]);
        }
        printf("\n");
    }

    transposeMatrix(&array[0][0], output, height,width);

    printf("output matrix\n");
    for(i=0;i<width;i++)
    {
        for(j=0;j<height;j++)
        {
            printf("%f\t",output[i*height + j]);
        }
        printf("\n");
    }
}

Edit:
To answer to your comment: Let' say we have matrix 4 * 5

| a00 a01 a02 a03 a04 |
| a10 a11 a12 a13 a14 |
| a20 a21 a22 a23 a24 |
| a30 a31 a32 a33 a34 |

In memory allocated with

// n = 4, m = 5
double* A = malloc(n * m * sizeof(double));

it will look like

| a00 a01 a02 a03 a04 a10 a11 a12 a13 a14 a20 a21 a22 a23 a24 a30 a31 a32 a33 a34 |

So to get element let's say (2, 3) we need to skip 2 * 5 elements (it's two rows each of 5 elements) and 3 elements from the beginning of the third row, so - 13 elements to skip in array.
And to generalize for matrix m * n to get (i, j) element we need to skip (i * m) + j elements in array.

Matrix Transpose in C

The indexing should be done in the following way.

double B[column * row];

for (int j = 0; j < row; j++){
    for (int i = 0; i < column; i++){
        B[j*column + i] = A[i*row + j];
    }
}

A Cache Efficient Matrix Transpose Program?

You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:

for (int i = 0; i < n; i += blocksize) {
    for (int j = 0; j < n; j += blocksize) {
        // transpose the block beginning at [i,j]
        for (int k = i; k < i + blocksize; ++k) {
            for (int l = j; l < j + blocksize; ++l) {
                dst[k + l*n] = src[l + k*n];
            }
        }
    }
}

An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.

Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.

Taking the transpose of a matrix in C with 1D arrays

Use couple of for loops to make the code easier to follow.

void transpose(int *array, int m, int n){

    int new_array[12];
    for (int i = 0; i < m; ++i )
    {
       for (int j = 0; j < n; ++j )
       {
          // Index in the original matrix.
          int index1 = i*n+j;

          // Index in the transpose matrix.
          int index2 = j*m+i;

          new_array[index2] = array[index1];
       }
    }

    for (int i=0; i<m*n; i++) {
        array[i] = new_array[i];
    }
}

Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.

For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.

A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

For a particular memory address, we can calculate which set should mirror it with the formula:

set = ( address / lineSize ) % numberOfsets

This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

I'll try to somewhat follow the example from Agner:

Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).

The critical stride can also be calculated:

criticalStride = numberOfSets * lineSize

Variables spaced criticalStride or a multiple apart contend for the same cache lines.

This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):

Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).

The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

Another disclaimer - I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)

Fast image (or matrix) transpose implementation in C++

Without touching the malloc/free part, the copying part could go like this:

    size_t len = image.bufferSize_,
           len1 = len - 1;

    unsigned char *src = image.buffer_,
                  *dest = mem,
                  *end = dest + len;

    for(size_t i = 0; i < len; i++)
    {
        *dest++ = *src;  // dest moves to next row
        src += height;   // src moves to next column

        // src wraps around and moves to next row
        if (src > end) src -= len1;
    }

This is a equivalent to having a column-wise destination iterator and a row-wise source iterator.

Without actually testing it, I feel this will be faster: it has 3 operations for offset computation in the inner loop vs. 4 in your version (plus 2 dereferencing operations in both versions).

EDIT

One more improvement, and a correction:

    //...
    unsigned char *src = image.buffer_,
                  *src_end = src + len,
                  *dest = mem,
                  *dest_end = dest + len;

    while (dest != dest_end)
    {
        *dest++ = *src;  // dest moves to next row
        src += height;   // src moves to next column

        // src wraps around and moves to next row
        if (src > src_end) src -= len1;
    }

This saves one more operation per iteration (i++ in the for loop). Also src was compared to the wrong end previously.

What is the fastest way to transpose the bits in an 8x8 block on bits?

This code is cribbed directly from "Hacker's Delight" - Figure 7-2 Transposing an 8x8-bit matrix, I take no credit for it:

void transpose8(unsigned char A[8], int m, int n, 
                unsigned char B[8]) {
   unsigned x, y, t; 

   // Load the array and pack it into x and y. 

   x = (A[0]<<24)   | (A[m]<<16)   | (A[2*m]<<8) | A[3*m]; 
   y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m]; 

   t = (x ^ (x >> 7)) & 0x00AA00AA;  x = x ^ t ^ (t << 7); 
   t = (y ^ (y >> 7)) & 0x00AA00AA;  y = y ^ t ^ (t << 7); 

   t = (x ^ (x >>14)) & 0x0000CCCC;  x = x ^ t ^ (t <<14); 
   t = (y ^ (y >>14)) & 0x0000CCCC;  y = y ^ t ^ (t <<14); 

   t = (x & 0xF0F0F0F0) | ((y >> 4) & 0x0F0F0F0F); 
   y = ((x << 4) & 0xF0F0F0F0) | (y & 0x0F0F0F0F); 
   x = t; 

   B[0]=x>>24;    B[n]=x>>16;    B[2*n]=x>>8;  B[3*n]=x; 
   B[4*n]=y>>24;  B[5*n]=y>>16;  B[6*n]=y>>8;  B[7*n]=y; 
}

I didn't check if this rotates in the direction you need, if not you might need to adjust the code.

Also, keep in mind datatypes & sizes - int & unsigned (int) might not be 32 bits on your platform.

BTW, I suspect the book (Hacker's Delight) is essential for the kind of work you're doing... check it out, lots of great stuff in there.

Fastest way to transpose-conjugate

If you goal is to do this fast then you should not bother writing your own matrix multiplication algorithm: use a library such as Eigen. It's true that there are matrix multiplication algorithms with better asymptotic time complexity than O(n^3) but it's also true that many people put too much faith in asymptotic time complexity.

Also, from experience working with large matrices in scientific research they have been quite sparse so I think the practical cases for large dense matrix multiplication are fewer than for sparse matrix multiplication. The algorithms for sparse matrix multiplication are very different than for dense matrix multiplication.

To answer your question about multiplying three matrices you should do matrix multiplication twice but the order can matter. Look into Matrix_chain_multiplication. Matrix multiplication is associative. Let's use the example from wikipedia. A is a 10 × 30 matrix, B is a 30 × 5 matrix, and C is a 5 × 60 matrix. Then,

(AB)C = (10×30×5) + (10×5×60) = 1500 + 3000 = 4500 operations
A(BC) = (30×5×60) + (10×30×60) = 9000 + 18000 = 27000 operations.

When all the matrices have the same size (as in your question) it does not matter though.

If you plan to continue optimizing dense matrix multiplication on the CPU you will need to use loop tiling, SIMD, threading, and maybe assembly. After several weeks you can probably write something that competes with Eigen.

What Is the Fastest Way to Transpose a Matrix in C++