Why Is There Huge Performance Hit in 2048X2048 Versus 2047X2047 Array Multiplication

Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?

This probably has do with conflicts in your L2 cache.

Cache misses on matice1 are not the problem because they are accessed sequentially.
However for matice2 if a full column fits in L2 (i.e when you access matice2[0, 0], matice2[1, 0], matice2[2, 0] ... etc, nothing gets evicted) than there is no problem with cache misses with matice2 either.

Now to go deeper in how caches works, if byte address of your variable is X, than the cache line for it would be (X >> 6) & (L - 1). Where L is total number of cache lines in your cache. L is always power of 2.
The six comes from fact that 2^6 == 64 bytes is standard size of cache line.

Now what does this mean? Well it means that if I have address X and address Y and
(X >> 6) - (Y >> 6) is divisible by L (i.e. some large power of 2), they will be stored in the same cacheline.

Now to go back to your problem what is the difference between 2048 and 2049,

when 2048 is your size:

if you take &matice2[x, k] and &matice2[y, k] the difference (&matice2[x, k] >> 6) - (&matice2[y,k] >> 6) will be divisible by 2048 * 4 (size of float). So a large power of 2.

Thus depending on size of your L2 you will have a lot of cache line conflicts, and only utilize small portion of your L2 to store a column, thus you wont actually be able to store full column in your cache, thus you will get bad performance.

When size is 2049, then the difference is 2049 * 4 which is not power of 2 thus you will have less conflicts and your column will safely fit into your cache.

Now to test this theory there are couple things you can do:

Allocate your array matice2 array like this matice2 [razmor, 4096], and run with razmor = 1024, 1025 or any size, and you should see very bad performance compared to what you had before. This is because you forcefully align all columns to conflict with each other.

Then try matice2 [razmor, 4097] and run it with any size and you should see much better performance.

Can't get over 50% max. theoretical performance on matrix multiply

Packing

You appear to be packing the block of the A matrix too often. You do

rpack(locA, A + ii*n + kk, kc, mc, mr, n);

But this only depends on ii and kk and not on jj but it's inside the inner loop on jj so you repack the same thing for each iteration of jj. I don't think that's necessary. In my code I do the packing before the matrix multiplication. Probably it's more efficient to pack inside the the matrix multiplication while the values are still in the cache but it's trickier to do that. But packing is a O(n^2) operation and matrix multiplication is a O(n^3) operation so it's not very inefficient to pack outside of the matrix multiplication for large matrices (I know that from testing as well - commenting out the packing only changes the efficiency by a few percent). However, by repacking with rpack each jj iteration you have effectively made it an O(n^3) operation.

Wall Time

You want the wall time. On Unix the clock() function does not return the wall time (though it does on Windows with MSVC). It returns the cumulative time for each thread. This is one of the most common errors I have seen on SO for OpenMP.

Use omp_get_wtime() to get the wall time.

Note that I don't know how the clock() function works with MinGW or MinGW-w64 (they are separate projects). MinGW links to MSVCRT so I would guess that clock() with MinGW returns the wall time as it does with MSVC. However, MinGW-w64 does not link to MSVCRT (as far as I understand it links to something like glibc). It's possible that clock() in MinGW-w64 performs the same as clock() does with Unix.

Hyper Threading

Hyper threading works well for code that stalls the CPU often. That's actually the majority of code because it's very difficult to write code that does not stall the CPU. That's why Intel invented Hyper Threading. It's easier to task switch and give the CPU something else to do than to optimize the code. However, for code that is highly optimized hyper-threading can actually give worse results. In my own matrix multiplication code that's certainly the case. Set the number of threads to the number of physical cores you have (two in your case).

My Code

Below is my code. I did not include the inner64 function here. You can find it at Difference in performance between MSVC and GCC for highly optimized matrix multplication code (with the obnoxious and misleading name of AddDot4x4_vec_block_8wide)

I wrote this code before reading the Goto paper and also before reading Agner Fog's optimization manuals. You appear to reorder/pack the matrices in the main loop. That probably makes more sense. I don't think I reorder them the same way you do and also I only reorder one of the input matrices (B) and not both as you do.

The performance of this code on my system (Xeon E5-1620@3.6) with Linux and GCC is about 75% of the peak for this matrix size (4096x4096). Intel's MKL get's about 94% of the peak on my system for this matrix size so there is clearly room for improvement.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <immintrin.h>

extern "C" void inner64(const float *a, const float *b, float *c);
void (*fp)(const float *a, const float *b, float *c) = inner64;

void reorder(float * __restrict a, float * __restrict b, int n, int bs) {
    int nb = n/bs;
    #pragma omp parallel for
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    b[bs*bs*(nb*i+j) + bs*i2+j2]= a[bs*(i*n+j) + i2*n + j2];    
                }
            }
        }
    }
}

inline void gemm_block(float * __restrict a, float * __restrict b, float * __restrict c, int n, int n2) {
    for(int i=0; i<n2; i++) {
        fp(&a[i*n], b, &c[i*n]);
    }
}

void gemm(float * __restrict a, float * __restrict b, float * __restrict c, int n, int bs) {
    int nb = n/bs;
    float *b2 = (float*)_mm_malloc(sizeof(float)*n*n,64);
    reorder(b,b2,n,bs);
    #pragma omp parallel for
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int k=0; k<nb; k++) {
                gemm_block(&a[bs*(i*n+k)],&b2[bs*bs*(k*nb+j)],&c[bs*(i*n+j)], n, bs);
            }
        }
    }
    _mm_free(b2);
}

int main() {
    float peak = 1.0f*8*4*2*3.69f;
    const int n = 4096;
    float flop = 2.0f*n*n*n*1E-9f;
    omp_set_num_threads(4);

    float *a = (float*)_mm_malloc(sizeof(float)*n*n,64);
    float *b = (float*)_mm_malloc(sizeof(float)*n*n,64);
    float *c = (float*)_mm_malloc(sizeof(float)*n*n,64);
    for(int i=0; i<n*n; i++) {
        a[i] = 1.0f*rand()/RAND_MAX;
        b[i] = 1.0f*rand()/RAND_MAX;
    }

    gemm(a,b,c,n,64); //warm OpenMP up
    while(1) {
        for(int i=0; i<n*n; i++) c[i] = 0;
        double dtime = omp_get_wtime();
        gemm(a,b,c,n,64);   
        dtime = omp_get_wtime() - dtime;
        printf("time %.2f s, efficiency %.2f%%\n", dtime, 100*flop/dtime/peak);
    }
}

Why is there a significant difference in this C++ for loop's execution time?

It's an issue of memory cache.

matrix[i][j] has better cache hits than matrix[j][i], since matrix[i][j] has more continuous memory accessing chances.

For example, when we access matrix[i][0], the cache may load a continuous segment of memory containing matrix[i][0], thus, accessing matrix[i][1], matrix[i][2], ..., will benefit from caching speed, since matrix[i][1], matrix[i][2], ... are near to matrix[i][0].

However, when we access matrix[j][0], it is far from matrix[j - 1][0] and may not been cached, and can not benefit from caching speed. Especially, a matrix is normally stored as a continuous big segment of memory, and the cacher may predicate the behavior of memory accessing and always cache the memory.

That's why matrix[i][j] is faster. This is typical in CPU cache based performance optimizing.

Why is MATLAB so fast in matrix multiplication?

Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:

>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.

MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray version uses MAGMA.

Update using R2014a on a machine with a Tesla K20c, and the new timeit and gputimeit functions:

>> A = rand(1024); gA = gpuArray(A);
>> timeit(@()A*A)
ans =
    0.0324
>> gputimeit(@()gA*gA)
ans =
    0.0022

Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:

>> timeit(@()A*A)
ans =
    0.0229
>> gputimeit(@()gA*gA)
ans =
   4.8019e-04

(NB: at some point (I forget when exactly) gpuArray switched from MAGMA to cuBLAS - MAGMA is still used for some gpuArray operations though)

Update using R2022a on a WIN64 machine with 32 physical cores and an A100 GPU:

>> timeit(@()A*A)
ans =
    0.0076
>> gputimeit(@()gA*gA)
ans =
   2.5344e-04

Matlab matrix multiplication speed

It's a combination of several things:

Matlab does indeed multi-thread.
The core is heavily optimized with vector instructions.

Here's the numbers on my machine: Core i7 920 @ 3.5 GHz (4 cores)

>> a = rand(10000);
>> b = rand(10000);
>> tic;a*b;toc
Elapsed time is 52.624931 seconds.

Task Manager shows 4 cores of CPU usage.

Now for some math:

Number of multiplies = 10000^3 = 1,000,000,000,000 = 10^12

Max multiplies in 53 secs =
    (3.5 GHz) * (4 cores) * (2 mul/cycle via SSE) * (52.6 secs) = 1.47 * 10^12

So Matlab is achieving about 1 / 1.47 = 68% efficiency of the maximum possible CPU throughput.

I see nothing out of the ordinary.

Comparision of Speed of matrix multiplication in matlab and C

Because matrix multiplication is highly optimised in Matlab, using LAPACK libraries under the hood.

You will find it hard to beat the performance of those libraries. Certainly, simple nested loops don't take cache effects into account, and will thus exhibit terrible performance.

Why Is There Huge Performance Hit in 2048X2048 Versus 2047X2047 Array Multiplication