Difference in Performance Between Msvc and Gcc for Highly Optimized Matrix Multplication Code

Difference in performance between MSVC and GCC for highly optimized matrix multplication code

Since we've covered the alignment issue, I would guess it's this: http://en.wikipedia.org/wiki/Out-of-order_execution

Since g++ issues a standalone load instruction, your processor can reorder the instructions to be pre-fetching the next data that will be needed while also adding and multiplying. MSVC throwing a pointer at mul makes the load and mul tied to the same instruction, so changing the execution order of the instructions doesn't help anything.

EDIT: Intel's server(s) with all the docs are less angry today, so here's more research on why out of order execution is (part of) the answer.

First of all, it looks like your comment is completely right about it being possible for the MSVC version of the multiplication instruction to decode to separate µ-ops that can be optimized by a CPU's out of order engine. The fun part here is that modern microcode sequencers are programmable, so the actual behavior is both hardware and firmware dependent. The differences in the generated assembly seems to be from GCC and MSVC each trying to fight different potential bottlenecks. The GCC version tries to give leeway to the out of order engine (as we've already covered). However, the MSVC version ends up taking advantage of a feature called "micro-op fusion". This is because of the µ-op retirement limitations. The end of the pipeline can only retire 3 µ-ops per tick. Micro-op fusion, in specific cases, takes two µ-ops that must be done on two different execution units (i.e. memory read and arithmetic) and ties them to a single µ-op for most of the pipeline. The fused µ-op is only split into the two real µ-ops right before execution unit assignment. After the execution, the ops are fused again, allowing them to be retired as one.

The out of order engine only sees the fused µ-op, so it can't pull the load op away from the multiplication. This causes the pipeline to hang while waiting for the next operand to finish its bus ride.

ALL THE LINKS!!!:
http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

http://www.agner.org/optimize/microarchitecture.pdf

http://www.agner.org/optimize/optimizing_assembly.pdf

http://www.agner.org/optimize/instruction_tables.ods
(NOTE: Excel complains that this spreadsheet is partially corrupted or otherwise sketchy, so open at your own risk. It doesn't seem to be malicious, though, and according to the rest of my research, Agner Fog is awesome. After I opted-in to the Excel recovery step, I found it full of tons of great data)

http://cs.nyu.edu/courses/fall13/CSCI-GA.3033-008/Microprocessor-Report-Sandy-Bridge-Spans-Generations-243901.pdf

http://www.syncfusion.com/Content/downloads/ebook/Assembly_Language_Succinctly.pdf

MUCH LATER EDIT:
Wow, there has been some interesting update to the discussion here. I guess I was mistaken about how much of the pipeline is actually affected by micro op fusion. Maybe there is more perf gain than I expected from the the differences in the loop condition check, where the unfused instructions allow GCC to interleave the compare and jump with the last vector load and arithmetic steps?

vmovups ymm9, YMMWORD PTR [rax-32]
cmp esi, edx
vmulps  ymm0, ymm0, ymm9
vaddps  ymm1, ymm1, ymm0
jg  .L4

Is there any performance issue between Cygwin's GCC over MSVC compiler on Windows?

i am just curios that is there any perfomance difference between them?

Sure, there are points where compilers compete:

Compile speed
Memory usage
Generated code efficiency
etc.

In my experience, the first 3 points go to MSVC. GCC on Windows (especially Cygwin distribution) is damn slow in compile speed, but I guess this is expected. GCC is cross-platform, has about 5 middle phases (transforming from one tree to another), has pluggable architecture, and many other things that could sacrifice compile speed for flexibility. I don't have enough data for MSVC architecture. Memory usage is not very significant but still MSVC does better job, I don't really have arguments why it happens, just looking at task manager values. Generated code efficiency is quite a fight. In many cases MSVC wins, but in some other GCC wins. Both are old compilers and have been improved with a LOT of optimizations. One big thing that GCC loses against MSVC is WPO (Whole Program Optimization). MSVC already has it mature for quite a long time, while AFAIK GCC is still moving toward maturity (4.X series is getting better and better, but is not yet comparable to MSVC one).

... cygwin package makes some linux like environment on windows and then gcc is going to use that so is there any draw back of that?

Actually, not really. You can use GCC without any Unix emulation environment. MinGW distribution can be used standalone without MSYS. It's still required however, if you want to compile programs with GNU style in mind (e.g. "./configure && make && make install" style).

gcc on linux and gcc with cygwin on windows both have any perfomance difference ?

Yes. GCC seems to run a lot faster on Linux than Windows, but it's not always GCC itself to blame. Process creation on Windows is a lot more complicated than on Linux (you can compare CreateProcess from WinAPI and exec from Unix) and in general it causes slower execution on Windows than Linux for ANY programs.

Why does GCC generate 15-20% faster code if I optimize for size instead of speed?

My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).

Short answer:

Is it the padding that is the culprit in this case? Why and how?

It all boils down to alignment. Alignments can have a significant impact on the performance, that is why we have the -falign-* flags in the first place.

I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3 does the appropriate alignment and the generated code does not show this weird behavior.

Of course, if an inappropriate alignment is done, it makes things worse. An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.

The noise it makes pretty much makes timing micro-optimizations
impossible.

How can I make sure that such accidental lucky / unlucky alignments
are not interfering when I do micro-optimizations (unrelated to stack
alignment) on C or C++ source codes?

Simply by telling gcc to do the right alignment:

g++ -O2 -falign-functions=16 -falign-loops=16

Long answer:

The code will run slower if:

an XX byte boundary cuts add() in the middle (XX being machine dependent).
if the call to add() has to jump over an XX byte boundary and the target is not aligned.
if add() is not aligned.
if the loop is not aligned.

The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os (executes in 0.994 secs):

00000000004004fd <_ZL3addRKiS0_.isra.0>:
  4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]
  400500:       c3

a 256 byte boundary cuts add() right in the middle and neither add() nor the loop is aligned. Surprise, surprise, this is the slowest case!

In case gcc-4.7.3 -Os (executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add() is cut):

00000000004004fa <_ZL3addRKiS0_.isra.0>:
  4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004fd:       c3                      ret

[...]

  40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>

Nothing is aligned, and the call to add() has to jump over the 256 byte boundary. This code is the second slowest.

In case gcc-4.6.4 -Os (executes in 0.709 secs), although nothing is aligned, the call to add() doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:

  4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>
  4004f7:       01 c3                   add    ebx,eax
  4004f9:       ff cd                   dec    ebp
  4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>

This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.

Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either.

I first noticed in 2009 that gcc (at least on my projects and on my
machines) have the tendency to generate noticeably faster code if I
optimize for size (-Os) instead of speed (-O2 or -O3) and I have been
wondering ever since why.

A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Os instead of -O2), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. From now on, I will check the effects of alignment on my projects.

Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add() fail?

Consider this:

// add.cpp
int add(const int& x, const int& y) {
    return x + y;
}

and in a separate file:

// main.cpp
int add(const int& x, const int& y);

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

and compiled as: g++ -O2 add.cpp main.cpp.

gcc won't inline add()!

That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler. If compile the above as: g++ -O2 -flto add.cpp main.cpp, that is, if I perform link time optimization, the code runs in 0.19s!

(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).

Can't get over 50% max. theoretical performance on matrix multiply

Packing

You appear to be packing the block of the A matrix too often. You do

rpack(locA, A + ii*n + kk, kc, mc, mr, n);

But this only depends on ii and kk and not on jj but it's inside the inner loop on jj so you repack the same thing for each iteration of jj. I don't think that's necessary. In my code I do the packing before the matrix multiplication. Probably it's more efficient to pack inside the the matrix multiplication while the values are still in the cache but it's trickier to do that. But packing is a O(n^2) operation and matrix multiplication is a O(n^3) operation so it's not very inefficient to pack outside of the matrix multiplication for large matrices (I know that from testing as well - commenting out the packing only changes the efficiency by a few percent). However, by repacking with rpack each jj iteration you have effectively made it an O(n^3) operation.

Wall Time

You want the wall time. On Unix the clock() function does not return the wall time (though it does on Windows with MSVC). It returns the cumulative time for each thread. This is one of the most common errors I have seen on SO for OpenMP.

Use omp_get_wtime() to get the wall time.

Note that I don't know how the clock() function works with MinGW or MinGW-w64 (they are separate projects). MinGW links to MSVCRT so I would guess that clock() with MinGW returns the wall time as it does with MSVC. However, MinGW-w64 does not link to MSVCRT (as far as I understand it links to something like glibc). It's possible that clock() in MinGW-w64 performs the same as clock() does with Unix.

Hyper Threading

Hyper threading works well for code that stalls the CPU often. That's actually the majority of code because it's very difficult to write code that does not stall the CPU. That's why Intel invented Hyper Threading. It's easier to task switch and give the CPU something else to do than to optimize the code. However, for code that is highly optimized hyper-threading can actually give worse results. In my own matrix multiplication code that's certainly the case. Set the number of threads to the number of physical cores you have (two in your case).

My Code

Below is my code. I did not include the inner64 function here. You can find it at Difference in performance between MSVC and GCC for highly optimized matrix multplication code (with the obnoxious and misleading name of AddDot4x4_vec_block_8wide)

I wrote this code before reading the Goto paper and also before reading Agner Fog's optimization manuals. You appear to reorder/pack the matrices in the main loop. That probably makes more sense. I don't think I reorder them the same way you do and also I only reorder one of the input matrices (B) and not both as you do.

The performance of this code on my system (Xeon E5-1620@3.6) with Linux and GCC is about 75% of the peak for this matrix size (4096x4096). Intel's MKL get's about 94% of the peak on my system for this matrix size so there is clearly room for improvement.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <immintrin.h>

extern "C" void inner64(const float *a, const float *b, float *c);
void (*fp)(const float *a, const float *b, float *c) = inner64;

void reorder(float * __restrict a, float * __restrict b, int n, int bs) {
    int nb = n/bs;
    #pragma omp parallel for
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    b[bs*bs*(nb*i+j) + bs*i2+j2]= a[bs*(i*n+j) + i2*n + j2];    
                }
            }
        }
    }
}

inline void gemm_block(float * __restrict a, float * __restrict b, float * __restrict c, int n, int n2) {
    for(int i=0; i<n2; i++) {
        fp(&a[i*n], b, &c[i*n]);
    }
}

void gemm(float * __restrict a, float * __restrict b, float * __restrict c, int n, int bs) {
    int nb = n/bs;
    float *b2 = (float*)_mm_malloc(sizeof(float)*n*n,64);
    reorder(b,b2,n,bs);
    #pragma omp parallel for
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int k=0; k<nb; k++) {
                gemm_block(&a[bs*(i*n+k)],&b2[bs*bs*(k*nb+j)],&c[bs*(i*n+j)], n, bs);
            }
        }
    }
    _mm_free(b2);
}

int main() {
    float peak = 1.0f*8*4*2*3.69f;
    const int n = 4096;
    float flop = 2.0f*n*n*n*1E-9f;
    omp_set_num_threads(4);

    float *a = (float*)_mm_malloc(sizeof(float)*n*n,64);
    float *b = (float*)_mm_malloc(sizeof(float)*n*n,64);
    float *c = (float*)_mm_malloc(sizeof(float)*n*n,64);
    for(int i=0; i<n*n; i++) {
        a[i] = 1.0f*rand()/RAND_MAX;
        b[i] = 1.0f*rand()/RAND_MAX;
    }

    gemm(a,b,c,n,64); //warm OpenMP up
    while(1) {
        for(int i=0; i<n*n; i++) c[i] = 0;
        double dtime = omp_get_wtime();
        gemm(a,b,c,n,64);   
        dtime = omp_get_wtime() - dtime;
        printf("time %.2f s, efficiency %.2f%%\n", dtime, 100*flop/dtime/peak);
    }
}

qmake: handling options for both gcc and msvc

Jean, to be precise, you should use this based on your description:

msvc:QMAKE_CXXFLAGS_RELEASE += /O2 /openmp /arch:AVX
gcc:QMAKE_CXXFLAGS_RELEASE += -O3 -march=native -fopenmp -D_GLIBCXX_PARALLEL

How to optimize my C++ OpenMp Matrix Multiplication code

Following things can be done for speedup:

Using OpenMP for parallelizing external loop, like you did (and like I also did in my following code). Or alternatively using std::async for multi-threading like it was used in another answer.
Transpose B matrix, this will help to increase L1 cache hits, because you will read from sequential memory each B column (or row in transposed variant).
Use vectorized SIMD instructions, this will allow to do several multiplications (and additions) within one CPU cycle. Often compilers do auto-vectorization of your loops well enough through SIMD instructions without your help, but I did it explicitly in my code.
Run several independent SIMD instructions within loop. This will help to occupy whole CPU pipeline of SIMD. I did so in my code by using four SIMD registers r0, r1, r2, r3. In compilers this is usually called loop unrolling.
Align your matrix starting address on 64-bytes boundary. This will help SIMD instructions to do fast aligned read/write.
Align starting address of each matrix row on 64-bytes boundary. I did this in my code by padding each row with zeros till multiple of 64-bytes. This also helps SIMD instructions to do fast aligned read/write.

In my following code I did all 1. - 6. steps above. Memory 64-bytes alignment I did through implementing AlignmentAllocator that was used in std::vector. Also I did time measurements for float/double/int.

On my old 4-core laptop I got following time measurements for the case of 1000x1000 matrix multiplying by 1000x1000:

 float: time 0.1569 sec
double: time 0.3168 sec
   int: time 0.1565 sec

To compare my hardware capabilities I did measurements of another answer of @doug for the case of int:

Threads w transpose   0.2164 secs.

As one can see my solution is 1.4x times faster that the other answer, I guess due to memory 64-bytes alignment and maybe due to using explicit SIMD (instead of relying on compiler auto-vectorization of a loop).

To compile my program, don't forget to add -fopenmp -lgomp options (for OpenMP support) and -march=native -O3 -std=c++20 options (for SIMD support, optimizations and standard) if you're compiling under GCC/CLang, while MSVC I guess adds OpenMP automatically and doesn't need any special options (use /O2 /GL /std:c++latest for optimizations and standard in MSVC).

In my code I only implemented SSE2/SSE4/AVX/AVX2 instructions for SIMD, if you have more powerful machine you may tell me and I implement also FMA/AVX-512, they will give even twice more speed boost.

My Mul() function is quite generic, it is templated, and you just pass pointers to matrices and row/col count, so your matrices may be stored on calling side in any way (through std::vector or std::array or plain 2D array).

At start of Run() function you may change number of rows and columns if you need a bigger test. Notice that all my functions support any rows and columns, you may even multiply matrix of size 1234x2345 by 2345x3456.

Try it online!

#include <cstdint>
#include <cstring>
#include <stdexcept>
#include <iostream>
#include <iomanip>
#include <vector>
#include <memory>
#include <string>

#include <immintrin.h>

#define USE_OPENMP 1
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#if defined(_MSC_VER)
    #define IS_MSVC 1
#else
    #define IS_MSVC 0
#endif

#if USE_OPENMP
    #include <omp.h>
#endif

template <typename T, std::size_t N>
class AlignmentAllocator {
  public:
    typedef T value_type;
    typedef std::size_t size_type;
    typedef std::ptrdiff_t difference_type;
    typedef T * pointer;
    typedef const T * const_pointer;
    typedef T & reference;
    typedef const T & const_reference;

  public:
    inline AlignmentAllocator() throw() {}
    template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
    inline ~AlignmentAllocator() throw() {}
    inline pointer adress(reference r) { return &r; }
    inline const_pointer adress(const_reference r) const { return &r; }
    inline pointer allocate(size_type n);
    inline void deallocate(pointer p, size_type);
    inline void construct(pointer p, const value_type & wert);
    inline void destroy(pointer p) { p->~value_type(); }
    inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
    template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
    bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
    bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};

template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
    #if IS_MSVC
        auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
    #else
        auto p = (pointer)std::aligned_alloc(N, n * sizeof(value_type));
    #endif
    ASSERT(p);
    return p;
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
    #if IS_MSVC
        _aligned_free(p);
    #else
        std::free(p);
    #endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::construct(pointer p, const value_type & wert) {
    new (p) value_type(wert);
}

template <typename T>
using AlignedVector = std::vector<T, AlignmentAllocator<T, 64>>;

template <typename T>
struct RegT;

#ifdef __AVX__
    template <> struct RegT<float> { static size_t constexpr bisize = 256; using type = __m256; static type zero() { return _mm256_setzero_ps(); } };
    template <> struct RegT<double> { static size_t constexpr bisize = 256; using type = __m256d; static type zero() { return _mm256_setzero_pd(); } };
    
    inline void MulAddReg(float const * a, float const * b, __m256 & c) {
        c = _mm256_add_ps(c, _mm256_mul_ps(_mm256_load_ps(a), _mm256_load_ps(b)));
    }
    inline void MulAddReg(double const * a, double const * b, __m256d & c) {
        c = _mm256_add_pd(c, _mm256_mul_pd(_mm256_load_pd(a), _mm256_load_pd(b)));
    }
    
    inline void StoreReg(float * dst, __m256 const & src) { _mm256_store_ps(dst, src); }
    inline void StoreReg(double * dst, __m256d const & src) { _mm256_store_pd(dst, src); }
#else // SSE2
    template <> struct RegT<float> { static size_t constexpr bisize = 128; using type = __m128; static type zero() { return _mm_setzero_ps(); } };
    template <> struct RegT<double> { static size_t constexpr bisize = 128; using type = __m128d; static type zero() { return _mm_setzero_pd(); } };

    inline void MulAddReg(float const * a, float const * b, __m128 & c) {
        c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(a), _mm_load_ps(b)));
    }
    inline void MulAddReg(double const * a, double const * b, __m128d & c) {
        c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(a), _mm_load_pd(b)));
    }
    
    inline void StoreReg(float * dst, __m128 const & src) { _mm_store_ps(dst, src); }
    inline void StoreReg(double * dst, __m128d const & src) { _mm_store_pd(dst, src); }
#endif

#ifdef __AVX2__
    template <> struct RegT<int32_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
    //template <> struct RegT<int64_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };

    inline void MulAddReg(int32_t const * a, int32_t const * b, __m256i & c) {
        c = _mm256_add_epi32(c, _mm256_mullo_epi32(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
    }
    //inline void MulAddReg(int64_t const * a, int64_t const * b, __m256i & c) {
    //    c = _mm256_add_epi64(c, _mm256_mullo_epi64(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
    //}
    
    inline void StoreReg(int32_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
    //inline void StoreReg(int64_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
#else // SSE2
    template <> struct RegT<int32_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
    //template <> struct RegT<int64_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };

    inline void MulAddReg(int32_t const * a, int32_t const * b, __m128i & c) {
        c = _mm_add_epi32(c, _mm_mullo_epi32(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
    }
    //inline void MulAddReg(int64_t const * a, int64_t const * b, __m128i & c) {
    //    c = _mm_add_epi64(c, _mm_mullo_epi64(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
    //}
    
    inline void StoreReg(int32_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
    //inline void StoreReg(int64_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
#endif    

template <typename T>
void Mul(T const * A0, size_t A_rows, size_t A_cols, T const * B0, size_t B_rows, size_t B_cols, T * C) {
    size_t constexpr reg_cnt = RegT<T>::bisize / 8 / sizeof(T), block = 4 * reg_cnt;
    ASSERT(A_cols == B_rows);
    size_t const A_cols_aligned = (A_cols + block - 1) / block * block, B_rows_aligned = (B_rows + block - 1) / block * block;
    
    // Copy aligned A
    AlignedVector<T> Av(A_rows * A_cols_aligned);
    for (size_t i = 0; i < A_rows; ++i)
        std::memcpy(&Av[i * A_cols_aligned], &A0[i * A_cols], sizeof(Av[0]) * A_cols);
    T const * A = Av.data();
    // Transpose B
    AlignedVector<T> Bv(B_cols * B_rows_aligned);
    for (size_t j = 0; j < B_cols; ++j)
        for (size_t i = 0; i < B_rows; ++i)
            Bv[j * B_rows_aligned + i] = B0[i * B_cols + j];
    T const * Bt = Bv.data();
    ASSERT(uintptr_t(A) % 64 == 0 && uintptr_t(Bt) % 64 == 0);
    ASSERT(uintptr_t(&A[A_cols_aligned]) % 64 == 0 && uintptr_t(&Bt[B_rows_aligned]) % 64 == 0);
    
    // Multiply
    #pragma omp parallel for
    for (size_t i = 0; i < A_rows; ++i) {
        // Aligned Reg storage
        AlignedVector<T> Regs(block);
        
        for (size_t j = 0; j < B_cols; ++j) {
            T const * Arow = &A[i * A_cols_aligned + 0], * Btrow = &Bt[j * B_rows_aligned + 0];
            
            using Reg = typename RegT<T>::type;
            Reg r0 = RegT<T>::zero(), r1 = RegT<T>::zero(), r2 = RegT<T>::zero(), r3 = RegT<T>::zero();
            
            size_t const k_hi = A_cols - A_cols % block;
            
            for (size_t k = 0; k < k_hi; k += block) {
                MulAddReg(&Arow[k + reg_cnt * 0], &Btrow[k + reg_cnt * 0], r0);
                MulAddReg(&Arow[k + reg_cnt * 1], &Btrow[k + reg_cnt * 1], r1);
                MulAddReg(&Arow[k + reg_cnt * 2], &Btrow[k + reg_cnt * 2], r2);
                MulAddReg(&Arow[k + reg_cnt * 3], &Btrow[k + reg_cnt * 3], r3);
            }
            
            StoreReg(&Regs[reg_cnt * 0], r0);
            StoreReg(&Regs[reg_cnt * 1], r1);
            StoreReg(&Regs[reg_cnt * 2], r2);
            StoreReg(&Regs[reg_cnt * 3], r3);
            
            T sum1 = 0, sum2 = 0, sum3 = 0;
            for (size_t k = 0; k < Regs.size(); ++k)
                sum1 += Regs[k];
            
            //for (size_t k = 0; k < A_cols - A_cols % block; ++k) sum3 += Arow[k] * Btrow[k];
            
            for (size_t k = k_hi; k < A_cols; ++k)
                sum2 += Arow[k] * Btrow[k];
            
            C[i * A_rows + j] = sum2 + sum1;
        }
    }
}

#include <random>
#include <thread>
#include <chrono>
#include <type_traits>

template <typename T>
void Test(T const * A, size_t A_rows, size_t A_cols, T const * B, size_t B_rows, size_t B_cols, T const * C, T eps) {
    for (size_t i = 0; i < A_rows / 16; ++i)
        for (size_t j = 0; j < B_cols / 16; ++j) {
            T sum = 0;
            for (size_t k = 0; k < A_cols; ++k)
                sum += A[i * A_cols + k] * B[k * B_cols + j];
            ASSERT_MSG(std::abs(C[i * A_rows + j] - sum) <= eps * A_cols, "i " + std::to_string(i) + " j " + std::to_string(j) +
                " C " + std::to_string(C[i * A_rows + j]) + " ref " + std::to_string(sum));
        }
}

double Time() {
    static auto const gtb = std::chrono::high_resolution_clock::now();

Difference in Performance Between Msvc and Gcc for Highly Optimized Matrix Multplication Code