Why Is a NaïVe C++ Matrix Multiplication 100 Times Slower Than Blas

Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

Here are three factors responsible for the performance difference between your code and BLAS (plus a note on Strassen’s algorithm).

In your inner loop, on k, you have y[k*dim + col]. Because of the way memory cache is arranged, consecutive values of k with the same dim and col map to the same cache set. The way cache is structured, each memory address has one cache set where its contents must be held while it is in cache. Each cache set has several lines (four is a typical number), and each of those lines can hold any of the memory addresses that map to that particular cache set.

Because your inner loop iterates through y in this way, each time it uses an element from y, it must load the memory for that element into the same set as the previous iteration did. This forces one of the previous cache lines in the set to be evicted. Then, in the next iteration of the col loop, all of the elements of y have been evicted from cache, so they must be reloaded again.

Thus, every time your loop loads an element of y, it must be loaded from memory, which takes many CPU cycles.

High-performance code avoids this in two ways. One, it divides the work into smaller blocks. The rows and the columns are partitioned into smaller sizes, and processed with shorter loops that are able to use all the elements in a cache line and to use each element several times before they go on to the next block. Thus, most of the references to elements of x and elements of y come from cache, often in a single processor cycle. Two, in some situations, the code will copy data out of a column of a matrix (which thrashes cache due to the geometry) into a row of a temporary buffer (which avoids thrashing). This again allows most of the memory references to be served from cache instead of from memory.

Another factor is the use of Single Instruction Multiple Data (SIMD) features. Many modern processors have instructions that load multiple elements (four float elements is typical, but some now do eight) in one instruction, store multiple elements, add multiple elements (e.g., for each of these four, add it to the corresponding one of those four), multiply multiple elements, and so on. Simply using such instructions immediately makes your code four times faster, provided you are able to arrange your work to use those instructions.

These instructions are not directly accessible in standard C. Some optimizers now try to use such instructions when they can, but this optimization is difficult, and it is not common to gain much benefit from it. Many compilers provide extensions to the language that give access to these instructions. Personally, I usually prefer to write in assembly language to use SIMD.

Another factor is using instruction-level parallel execution features on a processor. Observe that in your inner loop, acc is updated. The next iteration cannot add to acc until the previous iteration has finished updating acc. High-performance code will instead keep multiple sums running in parallel (even multiple SIMD sums). The result of this will be that while the addition for one sum is executing, the addition for another sum will be started. It is common on today’s processors to support four or more floating-point operations in progress at a time. As written, your code cannot do this at all. Some compilers will try to optimize the code by rearranging loops, but this requires the compiler to be able to see that iterations of a particular loop are independent from each other or can be commuted with another loop, et cetera.

It is quite feasible that using cache effectively provides a factor of ten performance improvement, SIMD provides another four, and instruction-level parallelism provides another four, giving 160 altogether.

Here is a very crude estimate of the effect of Strassen’s algorithm, based on this Wikipedia page. The Wikipedia page says Strassen is slightly better than direct multiplication around n = 100. This suggests the ratio of the constant factors of the execution times is 100³ / 100^2.807 ≈ 2.4. Obviously, this will vary tremendously depending on processor model, matrix sizes interacting with cache effects, and so on. However, simple extrapolation shows that Strassen is about twice as good as direct multiplication at n = 4096 ((4096/100)^3-2.807 ≈ 2.05). Again, that is just a ballpark estimate.

As for the later optimizations, consider this code in the inner loop:

bufz[trow][tcol] += B * bufy[tk][tcol];

One potential issue with this is that bufz could, in general, overlap bufy. Since you use global definitions for bufz and bufy, the compiler likely knows they do not overlap in this case. However, if you move this code into a subroutine that is passed bufz and bufy as parameters, and especially if you compile that subroutine in a separate source file, then the compiler is less likely to know that bufz and bufy do not overlap. In that case, the compiler cannot vectorize or otherwise reorder the code, because the bufz[trow][tcol] in this iteration might be the same as bufy[tk][tcol] in another iteration.

Even if the compiler can see that the subroutine is called with different bufz and bufy in the current source module, if the routine has extern linkage (the default), then the compiler has to allow for the routine to be called from an external module, so it must generate code that works correctly if bufz and bufy overlap. (One way the compiler can deal with this is to generate two versions of the routine, one to be called from external modules and one to be called from the module currently being compiled. Whether it does that depends on your compiler, the optimization switches, et cetera.) If you declare the routine as static, then the compiler knows it cannot be called from an external module (unless you take its address and there is a possibility the address is passed outside of the current module).

Another potential issue is that, even if the compiler vectorizes this code, it does not necessarily generate the best code for the processor you execute on. Looking at the generated assembly code, it appears the compiler is using only %ymm1 repeatedly. Over and over again, it multiplies a value from memory into %ymm1, adds a value from memory to %ymm1, and stores a value from %ymm1 to memory. There are two problems with this.

One, you do not want these partial sums stored to memory frequently. You want many additions accumulated into a register, and the register will be written to memory only infrequently. Convincing the compiler to do this likely requires rewriting the code to be explicit about keeping partial sums in temporary objects and writing them to memory after a loop has completed.

Two, these instructions are nominally serially dependent. The add cannot start until the multiply completes, and the store cannot write to memory until the add completes. The Core i7 has great capabilities for out-of-order execution. So, while it has that add waiting to start execution, it looks at the multiply later in the instruction stream and starts it. (Even though that multiply also uses %ymm1, the processor remaps the registers on the fly, so that it uses a different internal register to do this multiply.) Even though your code is filled with consecutive dependencies, the processor tries to execute several instructions at once. However, a number of things can interfere with this. You can run out of the internal registers the processor uses for renaming. The memory addresses you use might run into false conflicts. (The processor looks at a dozen or so of the low bits of memory addresses to see if the address might be the same as another one that it is trying to load or store from an earlier instruction. If the bits are equal, the processor has to delay the current load or store until it can verify the entire address is different. This delay can bollux up more than just the current load or store.) So, it is better to have instructions that are overtly independent.

That is one more reason I prefer to write high-performance code in assembly. To do it in C, you have to convince the compiler to give you instructions like this, by doing things such as writing some of your own SIMD code (using the language extensions for them) and manually unrolling loops (writing out multiple iterations).

When copying into and out of buffers, there might be similar issues. However, you report 90% of the time is spent in calc_block, so I have not looked at this closely.

How does BLAS get such extreme performance?

A good starting point is the great book The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ortí. They provide a free download version.

BLAS is divided into three levels:

Level 1 defines a set of linear algebra functions that operate on vectors only. These functions benefit from vectorization (e.g. from using SSE).
Level 2 functions are matrix-vector operations, e.g. some matrix-vector product. These functions could be implemented in terms of Level1 functions. However, you can boost the performance of this functions if you can provide a dedicated implementation that makes use of some multiprocessor architecture with shared memory.
Level 3 functions are operations like the matrix-matrix product. Again you could implement them in terms of Level2 functions. But Level3 functions perform O(N^3) operations on O(N^2) data. So if your platform has a cache hierarchy then you can boost performance if you provide a dedicated implementation that is cache optimized/cache friendly. This is nicely described in the book. The main boost of Level3 functions comes from cache optimization. This boost significantly exceeds the second boost from parallelism and other hardware optimizations.

By the way, most (or even all) of the high performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS/OpenBLAS is implemented in C and its performance critical parts in Assembler. Only the reference implementation of BLAS is implemented in Fortran. However, all these BLAS implementations provide a Fortran interface such that it can be linked against LAPACK (LAPACK gains all its performance from BLAS).

Optimized compilers play a minor role in this respect (and for GotoBLAS/OpenBLAS the compiler does not matter at all).

IMHO no BLAS implementation uses algorithms like the Coppersmith–Winograd algorithm or the Strassen algorithm. The likely reasons are:

Maybe its not possible to provide a cache optimized implementation of these algorithms (i.e. you would loose more then you would win)
These algorithms are numerically not stable. As BLAS is the computational kernel of LAPACK this is a no-go.
Although these algorithms have a nice time complexity on paper, the Big O notation hides a large constant, so it only starts to become viable for extremely large matrices.

Edit/Update:

The new and ground breaking paper for this topic are the BLIS papers. They are exceptionally well written. For my lecture "Software Basics for High Performance Computing" I implemented the matrix-matrix product following their paper. Actually I implemented several variants of the matrix-matrix product. The simplest variants is entirely written in plain C and has less than 450 lines of code. All the other variants merely optimize the loops

    for (l=0; l<MR*NR; ++l) {
        AB[l] = 0;
    }
    for (l=0; l<kc; ++l) {
        for (j=0; j<NR; ++j) {
            for (i=0; i<MR; ++i) {
                AB[i+j*MR] += A[i]*B[j];
            }
        }
        A += MR;
        B += NR;
    }

The overall performance of the matrix-matrix product only depends on these loops. About 99.9% of the time is spent here. In the other variants I used intrinsics and assembler code to improve the performance. You can see the tutorial going through all the variants here:

ulmBLAS: Tutorial on GEMM (Matrix-Matrix Product)

Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such a performance. And why it does not matter whether you use row or column major storage!

The final benchmarks are here (we called our project ulmBLAS):

Benchmarks for ulmBLAS, BLIS, MKL, openBLAS and Eigen

Another Edit/Update:

I also wrote some tutorial on how BLAS gets used for numerical linear algebra problems like solving a system of linear equations:

High Performance LU Factorization

(This LU factorization is for example used by Matlab for solving a system of linear equations.)

~~I hope to find time~~ to extend the tutorial to describe and demonstrate how to realise a highly scalable parallel implementation of the LU factorization like in PLASMA.

Ok, here you go: Coding a Cache Optimized Parallel LU Factorization

P.S.: I also did make some experiments on improving the performance of uBLAS. It actually is pretty simple to boost (yeah, play on words :) ) the performance of uBLAS:

Experiments on uBLAS.

Here a similar project with BLAZE:

Experiments on BLAZE.

LAPACK/BLAS sgemm() slower than custom matrix multiplication

The BLAS routine is used correctly.
The only difference is that BLAS is performing

C = 0.0*C + 1.0*A*B

and your loop

C = C + A*B

In your loop you are trying to improve usage of cpu's cache memory.
There are variants of BLAS that perform similar actions.
I suggest you to try openblas, atlas or mkl (intel compiler) libraries. You will get great time improvements.

Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python

After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python.

My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in np.show_config()).

Software:
Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0

1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed

a=rand(1000,1000);
b=rand(1000,1000);
c=rand(1000,1000);
tic
for i=1:100
    c=a*b;
end
toc/100

2) Python performance (%timeit a.dot(b,out=c)): 15.5 +- 0.8

I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little.

3) C++ performance. The following changes to the original (see the question) code were applied:

noalias function to avoid unnecessary temporal matrices creation.
Time was measured with c++11 chorno library

Here I used a bunch of different options and two different compilers:

3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time ~ 146 ms

3.2 Added -march=native option:

Execution time ~ 46 +-2 ms

3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias):

gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time 222 ms

3.4 Added - march=native option:

Execution time ~ 45.5 +- 1 ms

At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of Eigen::nbthreads() and by using MAC OS activity monitor.

3.5  gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp

Execution time: 16.5 +- 0.7 ms

3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added #define EIGEN_USE_MKL_ALL macro and that's it. It was hard to link all the libraries though:

gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl   main.cpp -o my_exec_intel -O3 -fopenmp  -march=native

Execution time: 14.33 +-0.26 ms. (Editor's note: this answer originally claimed to have used -DMKL_ILP64 which is not supported. Maybe it used to be, or happened to work.)

Conclusion:

Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).
CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries.
I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.
It would be great to try it with native Intel compiler icc. Unfortunately, this is proprietary software, unlike Intel mkl libraries.

Thanks for useful discussion,

Mikhail

Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms

Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k

Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.

Improving the performance of Matrix Multiplication

There are many, many things you can do to improve the efficiency of matrix multiplication.

To examine how to improve the basic algorithm, let's first take a look at our current options. The naive implementation, of course, has 3 loops with a time complexity of the order of O(n^3). There is another method called Strassen's Method which achieves a appreciable speedup and has the order of O(n^2.73) (but in practice is useless since it offers no appreciable means of optimization).

This is in theory. Now consider how matrices are stored in memory. Row major is the standard, but you find column major too. Depending on the scheme, transposing your matrix might improve speed due to fewer cache misses. Matrix multiplication in theory is just a bunch of vector dot products and addition. The same vector is to be operated upon by multiple vectors, thus it makes sense to keep that vector in cache for faster access.

Now, with the introduction of multiple cores, parallelism and the cache concept, the game changes. If we look a little closely, we see that a dot product is nothing but a bunch of multiplications followed by summations. These multiplications can be done in parallel. Hence, we can now look at parallel loading of numbers.

Now let's make things a little more complicated. When talking about matrix multiplication, there is a distinction between single floating point and double floating point in their size. Often the former is 32 bits while the latter, 64 (of course, this depends on the CPU). Each CPU only has a fixed number of registers, meaning the bigger your numbers, the lesser you can fit in the CPU. Moral of the story is, stick to single floating point unless you really need double.

Now that we've gone through the basics of how we can tune matrix multiplication, worry not. You need do nothing of what has been discussed above since there are already subroutines to do it. As mentioned in the comments, there's GotoBLAS, OpenBLAS, Intel's MKL, and Apple's Accelerate framework. MKL/Accelerate are proprietary, but OpenBLAS is a very competitive alternative.

Here's a nice little example that multiplies 2 8k x 8k matrices in a few milliseconds on my Macintosh:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <Accelerate/Accelerate.h>

int SIZE = 8192;

typedef float point_t;

point_t* transpose(point_t* A) {    
    point_t* At = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));    
    vDSP_mtrans(A, 1, At, 1, SIZE, SIZE);

    return At;
}

point_t* dot(point_t* A, point_t* B) {
    point_t* C = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));       
    int i;    
    int step = (SIZE * SIZE / 4);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[0], SIZE, B, SIZE, 0.0, &C[0], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step], SIZE, B, SIZE, 0.0, &C[step], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step * 2], SIZE, B, SIZE, 0.0, &C[step * 2], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step * 3], SIZE, B, SIZE, 0.0, &C[step * 3], SIZE);      

    return C;
}

void print(point_t* A) {
    int i, j;
    for(i = 0; i < SIZE; i++) {
        for(j = 0; j < SIZE; j++) {
            printf("%f  ", A[i * SIZE + j]);
        }
        printf("\n");
    }
}

int main() {
    for(; SIZE <= 8192; SIZE *= 2) {
        point_t* A = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
        point_t* B = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));

        srand(getpid());

        int i, j;
        for(i = 0; i < SIZE * SIZE; i++) {
            A[i] = ((point_t)rand() / (double)RAND_MAX);
            B[i] = ((point_t)rand() / (double)RAND_MAX);
        }

        struct timeval t1, t2;
        double elapsed_time;

        gettimeofday(&t1, NULL);
        point_t* C = dot(A, B);
        gettimeofday(&t2, NULL);

        elapsed_time = (t2.tv_sec - t1.tv_sec) * 1000.0;      // sec to ms
        elapsed_time += (t2.tv_usec - t1.tv_usec) / 1000.0;   // us to ms

        printf("Time taken for %d size matrix multiplication: %lf\n", SIZE, elapsed_time/1000.0);

        free(A);
        free(B);
        free(C);

    }
    return 0;
}

At this point I should also mention SSE (Streaming SIMD Extension), which is basically something you shouldn't do unless you've worked with assembly. Basically, you're vectorising your C code, to work with vectors instead of integers. This means you can operate on blocks of data instead of single values. The compiler gives up and just translates your code as is without doing its own optimizations. If done right, it can speed up your code like nothing before - you can touch the theoretical floor of O(n^2) even! But it is easy to abuse SSE, and most people unfortunately do, making the end result worse than before.

I hope this motivates you to dig deeper. The world of matrix multiplication is a large and fascinating one. Below, I attach links for further reading.

OpenBLAS
More about SSE
Intel Intrinsics

Why Is a NaïVe C++ Matrix Multiplication 100 Times Slower Than Blas