Why Is Transposing a Matrix of 512X512 Much Slower Than Transposing a Matrix of 513X513

Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.

For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.

A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

For a particular memory address, we can calculate which set should mirror it with the formula:

set = ( address / lineSize ) % numberOfsets

This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

I'll try to somewhat follow the example from Agner:

Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).

The critical stride can also be calculated:

criticalStride = numberOfSets * lineSize

Variables spaced criticalStride or a multiple apart contend for the same cache lines.

This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):

Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).

The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

Another disclaimer - I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)

What is the fastest way to transpose a matrix in C++?

This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.

First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)

void transpose(float *src, float *dst, const int N, const int M) {
    #pragma omp parallel for
    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
        dst[n] = src[M*j + i];
    }
}

Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.

for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*l+j];
        }
        C[K*i + j] = tmp;
    }
}

That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first

transpose(B);
for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*j+l];
        }
        C[K*i + j] = tmp;
    }
}
transpose(B);

Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.

I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.

For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is

Smear image horizontally
transpose output 
Smear output horizontally
transpose output

Here is a paper by Intel explaining that
http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use

void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
    #pragma omp parallel for
    for(int n=0; n<M*N; n++) {
        int k = vec_size*(n/N/vec_size);
        int i = (n/vec_size)%N;
        int j = n%vec_size;
        B[n] = A[M*i + k + j];
    }
}

EDIT:

I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16 (Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).

inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<block_size; i++) {
        for(int j=0; j<block_size; j++) {
            B[j*ldb + i] = A[i*lda +j];
        }
    }
}

inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
        }
    }
}

The values lda and ldb are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this

#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);

float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);

For 3000x1001 this returns ldb = 3008 and lda = 1008

Edit:

I found an even faster solution using SSE intrinsics:

inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
    __m128 row1 = _mm_load_ps(&A[0*lda]);
    __m128 row2 = _mm_load_ps(&A[1*lda]);
    __m128 row3 = _mm_load_ps(&A[2*lda]);
    __m128 row4 = _mm_load_ps(&A[3*lda]);
     _MM_TRANSPOSE4_PS(row1, row2, row3, row4);
     _mm_store_ps(&B[0*ldb], row1);
     _mm_store_ps(&B[1*ldb], row2);
     _mm_store_ps(&B[2*ldb], row3);
     _mm_store_ps(&B[3*ldb], row4);
}

inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            int max_i2 = i+block_size < n ? i + block_size : n;
            int max_j2 = j+block_size < m ? j + block_size : m;
            for(int i2=i; i2<max_i2; i2+=4) {
                for(int j2=j; j2<max_j2; j2+=4) {
                    transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
                }
            }
        }
    }
}

why is filling a row of a matrix much slower than filling vector of the same size

Because you are filling in a row not a column. You get more cache misses, as well as address arithmetic overhead (locating base address for each column).

In R, a matrix, or a general array is essentially a long vector; dimension is just an attribute. The fastest index is the rightmost one, so elements are contiguous when you scan the matrix by its column, or an array by its last margin. This means, on a typical machine with 64 bytes L1 cache line size which can hold 8 double precision numeric, you have 1 cache miss per scanning 8 elements. But if you access a matrix of two rows by row for example, you get 1 cache miss per 4 elements.

Another issue, giving you an unfair comparison, is you are comparing overwriting with assignment, not overwriting with overwriting. Using u[] <- 1:N will force an overwriting to the preallocated vector u.

Consider the following benchmarking:

library(microbenchmark)
n <- 10000
u <- 1:n
m1 <- matrix(0, n, 2)
m2 <- matrix(0, 2, n)
x <- numeric(n)
microbenchmark (m1[,1] <- u, m2[1,] <- u, x[] <- u)

2D blocking with unique matrix transpose problem

TL;DR: the 2D blocking code is not faster mainly because of cache trashing (and a bit due to prefetching). Both the naive implementation and the 2D-blocking are inefficient. A similar effect is described in this post. You can mitigate this problem using small a temporary array and improve performance using a register-tiling optimization. Additional padding may also be helpful.

CPU Caches

First of all, modern CPU cache are set-associative cache. A m-way associative cache can be seen as a n x m matrix with n sets containing a block of m cache lines. Memory blocks are first mapped onto a set and then placed into any of the m cache line of a target set (regarding the cache replacement policy). When an algorithm (like a transposition) performs strided memory accesses, the processor often access to the same set and the number of target cache lines that can be used is much smaller. This is especially true with strides that are a big power of two (or more precisely divisible by a big power of two). The number of possible cache line that can be accesses can be computed using modular arithmetic or basic simulations. In the worst case, all accesses are done on the same cache line set (containing m cache lines). In this case, each access cause the the processor to evict one cache line in the set to load a new one in it using a replacement policy which is generally not perfect. Your Intel(R) Xeon(R) Platinum 8168 processor have the following cache properties:

  Level     Size (KiB/core)    Associativity    # Sets
-------------------------------------------------------
L1D Cache           32             8-way           64
 L2 Cache         1024            16-way         1024
 L3 Cache         1408            11-way         2048

* All cache have 64-byte cache lines

This means that is you perform accesses with a stride divisible by 4096 bytes (ie. 256 complex numbers), then all accesses will be mapped to the same L1D cache set in the L1D cache resulting in conflict misses since only 8 cache lines can be loaded simultaneously. This cause an effect called cache trashing decreasing considerably the performance. A more complete explanation is available in this post: How does the CPU cache affect the performance of a C program?.

Impact of caches on the transposition

You mentioned that block_size can be up to 256 so n1 and n3 are divisible by 256 since the provided code already make the implicit assumption that n1 and n3 are divisible by block_size and I expect n2 to be certainly divisible by 256 too. This means that the size of a line along the dimension 3 is p * 256 * (2 * 8) = 4096p bytes = 64p cache lines where p = n3 / block_size. This means that the ith item of all lines will be mapped to the exact same L1D cache set.

Since you perform a transposition between the dimension 1 and 3, this means the space between lines is even bigger. The gap between two ith items of two subsequent lines is G = 64 * p * n2 cache lines. Assuming n2 is divisible by 16, then G = 1024 * p * q cache lines where q = n2 / 16.

Having such a gap is a huge problem. Indeed, your algorithm reads/writes to many ith items of many lines in the same block. Thus, such accesses will be mapped to the same set in both the L1D and L2 caches resulting in cache trashing. If block_size > 16, cache lines will be nearly systematically reloaded to the L2 (4 times). I expect the L3 not to help much in this case since it is mostly designed for shared data between cores, its latency is quite big (50-70 cycles) and p * q is certainly divisible by a power of two. The latency cannot be mitigated by the processor due to the lack of concurrency (ie. available cache lines that could be prefetched simultaneously). This cause the bandwidth to be wasted, not to mention the non-contiguous accesses already decrease the throughput. Such an effect should already be visible with smaller power of two block_size values as shown in the previous related post (linked above).

Impact of prefetching

Intel Skylake SP processors like yours prefetch at least 2 cache lines (128 bytes) simultaneously per access in this case. This means that a block_size < 8 is not enough so to completely use prefetched cache lines of tens. As a result, a too small block_size waste the bandwidth due to prefetching and a too big also waste the bandwidth but due to cache trashing. I expect the best block_size to be close to 8.

How to address this problem

One solution is to store each block in a small temporary 2D array, then transpose it, and then write it. At first glance, it looks like it will be slower due to more memory accesses, but it is generally significantly faster. Indeed, if the block size is reasonably small (eg. <=32), then the small temporary array can completely fit in the L1 cache and is thus not subject to any cache trashing during the transposition. The block can be read as efficiently but it can be stored much more efficiently (ie. more contiguous accesses).

Adding another blocking level should help a bit to improve performance due to the L2 cache being better efficiently used (eg. with block_size set 128~256). Lebesgue curve can be used to implement a fast cache oblivious algorithm though it makes the code more complex.

Another optimization consist in adding yet another blocking level to perform an optimization called register-tiling. The idea is to use 2 nested loops operating an a tile with a small compile-time constant for the compiler to unroll the loop and generate better instructions. For example, with tile of size 4x4, this enable the compiler to generate the following code (see on Godbolt):

..B3.7:
        vmovupd   xmm0, XMMWORD PTR [rdx+r8]
        vmovupd   XMMWORD PTR [r15+rdi], xmm0 
        inc       r14 
        vmovupd   xmm1, XMMWORD PTR [16+rdx+r8]
        vmovupd   XMMWORD PTR [r15+r10], xmm1 
        vmovupd   xmm2, XMMWORD PTR [32+rdx+r8]
        vmovupd   XMMWORD PTR [r15+r12], xmm2 
        vmovupd   xmm3, XMMWORD PTR [48+rdx+r8]
        vmovupd   XMMWORD PTR [r15+r13], xmm3 
        vmovupd   xmm4, XMMWORD PTR [rdx+r9]
        vmovupd   XMMWORD PTR [16+r15+rdi], xmm4 
        vmovupd   xmm5, XMMWORD PTR [16+rdx+r9]
        vmovupd   XMMWORD PTR [16+r15+r10], xmm5 
        vmovupd   xmm6, XMMWORD PTR [32+rdx+r9]
        vmovupd   XMMWORD PTR [16+r15+r12], xmm6 
        vmovupd   xmm7, XMMWORD PTR [48+rdx+r9]
        vmovupd   XMMWORD PTR [16+r15+r13], xmm7 
        vmovupd   xmm8, XMMWORD PTR [rdx+r11]
        vmovupd   XMMWORD PTR [32+r15+rdi], xmm8 
        vmovupd   xmm9, XMMWORD PTR [16+rdx+r11]
        vmovupd   XMMWORD PTR [32+r15+r10], xmm9 
        vmovupd   xmm10, XMMWORD PTR [32+rdx+r11]
        vmovupd   XMMWORD PTR [32+r15+r12], xmm10 
        vmovupd   xmm11, XMMWORD PTR [48+rdx+r11]
        vmovupd   XMMWORD PTR [32+r15+r13], xmm11 
        vmovupd   xmm12, XMMWORD PTR [rdx+rbp]
        vmovupd   XMMWORD PTR [48+r15+rdi], xmm12 
        vmovupd   xmm13, XMMWORD PTR [16+rdx+rbp]
        vmovupd   XMMWORD PTR [48+r15+r10], xmm13 
        vmovupd   xmm14, XMMWORD PTR [32+rdx+rbp]
        vmovupd   XMMWORD PTR [48+r15+r12], xmm14 
        vmovupd   xmm15, XMMWORD PTR [48+rdx+rbp]
        vmovupd   XMMWORD PTR [48+r15+r13], xmm15 
        add       r15, rsi 
        add       rdx, 64 
        cmp       r14, rbx 
        jb        ..B3.7

Instead of this (repeated 8 times more):

..B2.12:
        vmovupd   xmm0, XMMWORD PTR [rsi+r14]
        vmovupd   XMMWORD PTR [rbx+r15], xmm0
        inc       rax
        vmovupd   xmm1, XMMWORD PTR [16+rsi+r14]
        vmovupd   XMMWORD PTR [rbx+r13], xmm1
        add       rbx, r9
        add       rsi, 32
        cmp       rax, rcx
        jb        ..B2.12

Finally, one can use AVX/AVX-2/AVX-512 intrinsics to implement a faster tile transposition for x86-64 only processors.

Note that adding some padding in the end of each line so that they are not divisible by a power of should also significantly help to reduce cache trashing. That being said, this may not be useful once the above optimizations have already been applied.

Calculating matrix product is much slower with SSE than with straight-forward-algorithm

You have the right idea by taking the transpose in the scalar code but you don't want exactly the transpose when using SSE.

Let's stick to float (SGEMM). What you want to do with SSE is do four dot products at once. You want C = A*B. Let's look at a 8x8 matrix. Let's assume B is:

(0   1  2  3) ( 4  5  6  7)
(8   9 10 11) (12 13 14 15) 
(16 17 18 19) (20 21 22 23)
(24 25 26 27) (28 29 30 31)
(32 33 34 35) (36 37 38 39)
(40 41 42 43) (44 45 46 47)
(48 49 50 51) (52 53 54 55)
(56 57 58 59) (60 61 62 63)

So with SSE you do:

C[0][0] C[0][1] C[0][2] C[0][3] = 
A[0][0]*(0 1 2 3) + A[0][1]*(8 9 10 11) + A[0][2]*(16 17 18 19)...+ A[0][7]*(56 57 58 59)

That gets you four dot products at once. The problem is that you have to move down a column in B and the values are not in the same cache line. It would be better if each column of width four was contiguous in memory. So instead of taking the transpose of each element you transpose strips with a width of 4 like this:

(0  1  2  3)( 8  9 10 11)(16 17 18 19)(24 25 26 27)(32 33 34 35)(40 41 42 43)(48 49 50 51)(56 57 58 59)
(4  5  6  7)(12 13 14 15)(20 21 22 23)(28 29 30 31)(36 37 38 39)(44 45 46 47)(52 53 54 55)(60 61 62 63)

If you think of each of the four values in parentheses as one unit this is equivalent to transposing a 8x2 matrix to a 2x8 matrix. Notice now that the columns of width four of B are contiguous in memory. This is far more cache friendly. For an 8x8 matrix this is not really an issue but for example with a 1024x1024 matrix it is. See the code below for how to do this. For AVX you transpose strips of width 8 (which means you have nothing to do for a 8x8 matrix). For double the width is two with SSE and four with AVX.

This should be four times faster than the scalar code assuming the matrices fit in the cache. However, for large matrices this method is still going to be memory bound and so your SSE code may not be much faster than scalar code (but it should not be worse).

However, if you use loop tiling and rearranging the matrix in tiles (which fit in the L2 cache) rather than for the whole matrix matrix multiplication is computation bound and not memory bound even for very large matrices that don't fit in the L3 cache. That's another topic.

Edit: some (untested) code to compare to your scalar code. I unrolled the loop by 2.

void SGEMM_SSE(const float *A, const float *B, float *C, const int sizeX) {
    const int simd_width = 4;
    const int unroll = 2;
    const int strip_width = simd_width*unroll
    float *D = (float*)_mm_malloc(sizeof(float)*sizeX*sizeX, 16);
    transpose_matrix_strip(B, D,sizeX, strip_width); //tranpose B in strips of width eight
    for(int i = 0; i < sizeX; i++) {
        for(int j = 0; j < sizeX; j+=strip_width) {
            float4 out_v1 = 0; //broadcast (0,0,0,0)
            float4 out_V2 = 0;
            //now calculate eight dot products
            for(int g = 0; g < sizeX; g++) {
                //load eight values rrom D into two SSE registers
                float4 vec4_1.load(&D[j*sizeX + strip_width*g]);
                float4 vec4_2.load(&D[j*sizeX + strip_width*g + simd_width]);
                out_v1 += A[i][g]*vec4_v1;
                out_v2 += A[i][g]*vec4_v2;
            }
            //store eight dot prodcuts into C
            out_v1.store(&C[i*sizeX + j]);
            out_v2.store(&C[i*sizeX + j + simd_width]);
        }
    }
    _mm_free(D);
}

void transpose_matrix_strip(const float* A, float* B, const int N, const int strip_width) {
    //#pragma omp parallel for
    for(int n=0; n<N*N; n++) {
        int k = strip_width*(n/N/strip_width);
        int i = (n/strip_width)%N;
        int j = n%strip_width;
        B[n] = A[N*i + k + j];
    }
}

Notice that j increments by eight now. More unrolling may help. If you want to use intrinsics you can use _mm_load_ps, _mm_store_ps, _mm_set1_ps (for the broadcasts e.g. _mm_set1_ps(A[i][g])), _mm_add_ps, and _mm_mul_ps. That's it.

Why is str.replace so much slower with a single outlier?

str.replace() is implemented at https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L10604 in the C function replace(). Here's an excerpt from that function. It shows that if the size of the returned string (new_size) will be 0, we stop early and return an empty string. Otherwise, we perform the long task of looping through the input string and performing the replacements one-by-one.

new_size = slen + n * (len2 - len1);
if (new_size == 0) {
    u = unicode_new_empty();
    goto done;
}
if (new_size > (PY_SSIZE_T_MAX / rkind)) {
    PyErr_SetString(PyExc_OverflowError,
                    "replace string is too long");
    goto error;
}
u = PyUnicode_New(new_size, maxchar);
if (!u)
    goto error;
assert(PyUnicode_KIND(u) == rkind);
res = PyUnicode_DATA(u);
ires = i = 0;
if (len1 > 0) {
    while (n-- > 0) {
        /* look for next match */
        j = anylib_find(rkind, self,
                        sbuf + rkind * i, slen-i,
                        str1, buf1, len1, i);
        if (j == -1)
            break;
        else if (j > i) {
            /* copy unchanged part [i:j] */
            memcpy(res + rkind * ires,
                   sbuf + rkind * i,
                   rkind * (j-i));
            ires += j - i;
        }
        /* copy substitution string */
        if (len2 > 0) {
            memcpy(res + rkind * ires,
                   buf2,
                   rkind * len2);
            ires += len2;
        }
        i = j + len1;
    }
    if (i < slen)
        /* copy tail [i:] */
        memcpy(res + rkind * ires,
               sbuf + rkind * i,
               rkind * (slen-i));
}

Matrix operations using code vectorization

I'm not sure how to do a in-place transpose for arbitrary matrices using SIMD efficiently but I do know how to do it for out-of-place. Let me describe how to do both

In place transpose

For in-place transpose you should see Agner Fog's Optimizing software in C++ manual. See section 9.10 "Cache contentions in large data structures" example 9.5a. For certain matrix sizes you will see a large drop in performance due to cache aliasing. See table 9.1 for examples and this Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?. Agner gives a way to fix this using loop tiling (similar to what Paul R described) in Example 9.5b.

Out of place transpose

See my answer here (the one with the most votes) What is the fastest way to transpose a matrix in C++?. I have not looked into this in ages but let me just repeat my code here:

inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
    __m128 row1 = _mm_load_ps(&A[0*lda]);
    __m128 row2 = _mm_load_ps(&A[1*lda]);
    __m128 row3 = _mm_load_ps(&A[2*lda]);
    __m128 row4 = _mm_load_ps(&A[3*lda]);
     _MM_TRANSPOSE4_PS(row1, row2, row3, row4);
     _mm_store_ps(&B[0*ldb], row1);
     _mm_store_ps(&B[1*ldb], row2);
     _mm_store_ps(&B[2*ldb], row3);
     _mm_store_ps(&B[3*ldb], row4);
}

inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            int max_i2 = i+block_size < n ? i + block_size : n;
            int max_j2 = j+block_size < m ? j + block_size : m;
            for(int i2=i; i2<max_i2; i2+=4) {
                for(int j2=j; j2<max_j2; j2+=4) {
                    transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
                }
            }
        }
    }   
}

Why does blocking show no performance benefit in matrix multiplication

Blocking only improves the execution time if caches are actually a bottleneck. The thing is the current code should be compute-bound. Indeed, GCC does not vectorize the code because floating-point operation are not associative and does not make this assumption by default (it can break some codes). You can fix that by enabling -ffast-math which also enable other useful flags for auto-vectorization (but that are also even more unsafe: for example NaN values are assume not to be used).
In fact, the generally assembly code of the hot loop of matmulBlock is very inefficient:

.L81:
        vmovupd ymm4, YMMWORD PTR [rdx+rax]
        vmulpd  ymm2, ymm4, YMMWORD PTR [rcx+rax]
        add     rsi, 1
        add     rax, 32
        vaddsd  xmm0, xmm2, xmm0
        vunpckhpd       xmm3, xmm2, xmm2
        vextractf128    xmm1, ymm2, 0x1
        vaddsd  xmm3, xmm3, xmm0
        vaddsd  xmm0, xmm1, xmm3
        vunpckhpd       xmm1, xmm1, xmm1
        vaddsd  xmm0, xmm0, xmm1
        cmp     rsi, r13
        jne     .L81

With -ffast-math is is much better but still sub-optimal:

.L79:
        vmovupd ymm4, YMMWORD PTR [rdx+rax]
        vmulpd  ymm0, ymm4, YMMWORD PTR [rcx+rax]
        add     rsi, 1
        add     rax, 32
        vaddpd  ymm1, ymm1, ymm0
        cmp     rsi, r13
        jne     .L79

For better performance, you can enable the FMA instruction set which is AFAIK generally available on machine supporting AVX-2 (especially on recent processors). Then unrolling can be used to make the code even more performant.

Why Is Transposing a Matrix of 512X512 Much Slower Than Transposing a Matrix of 513X513