Why Is This Simd Multiplication Not Faster Than Non-Simd Multiplication

Why is this SIMD multiplication not faster than non-SIMD multiplication?

There was a major bug in the timing function I used for previous benchmarks. This grossly underestimated the bandwidth without vectorization as well as other measurements. Additionally, there was another problem that was overestimating the bandwidth due to COW on the array that was read but not written to. Finally, the maximum bandwidth I used was incorrect. I have updated my answer with the corrections and I have left the old answer at the end of this answer.

Your operation is memory bandwidth bound. This means the CPU is spending most of its time waiting on slow memory reads and writes. An excellent explanation for this can be found here: Why vectorizing the loop does not have performance improvement.

However, I have to disagree slightly with one statement in that answer.

So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.

In fact, vectorization~~, unrolling,~~ and multiple threads can significantly increase the bandwidth even in memory bandwidth bound operations. The reason is that it is difficult to obtain the maximum memory bandwidth. A good explanation for this can be found here: https://stackoverflow.com/a/25187492/2542702.

The rest of my answer will show how vectorization and multiple threads can get closer to the maximum memory bandwidth.

My test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM, dual channel DDR4@2400 GHz. The maximum bandwidth from my system is 38.4 GB/s.

From the code below I produce the following tables. I set the number of thread using OMP_NUM_THREADS e.g. export OMP_NUM_THREADS=4. The efficiency is bandwidth/max_bandwidth.

-O2 -march=native -fopenmp
Threads Efficiency
1       59.2%
2       76.6%
4       74.3%
8       70.7%

-O2 -march=native -fopenmp -funroll-loops
1       55.8%
2       76.5%
4       72.1%
8       72.2%

-O3 -march=native -fopenmp
1       63.9%
2       74.6%
4       63.9%
8       63.2%

-O3 -march=native -fopenmp -mprefer-avx128
1       67.8%
2       76.0%
4       63.9%
8       63.2%

-O3 -march=native -fopenmp -mprefer-avx128 -funroll-loops
1       68.8%
2       73.9%
4       69.0%
8       66.8%

After several iterations of running due to uncertainties in the measurements I have formed the following conclusions:

single threaded scalar operations get more than 50% of the bandwidth.
two threaded scalar operations get the highest bandwidth.
single threaded vector operations are faster than single threaded scalar operations.
single threaded SSE operations are faster than single threaded AVX operations.
unrolling is not helpful.
unrolling single-threaded operations is slower than without unrolling.
more threads than cores (Hyper-Threading) gives a lower bandwidth.

The solution that gives the best bandwidth is scalar operations with two threads.

The code I used to benchmark:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>

#define N 10000000
#define R 100

void mul(double *a, double *b) {
  #pragma omp parallel for
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

int main() {
  double maxbw = 2.4*2*8; // 2.4GHz * 2-channels * 64-bits * 1-byte/8-bits 
  double mem = 3*sizeof(double)*N*R*1E-9; // GB

  double *a = (double*)malloc(sizeof *a * N);
  double *b = (double*)malloc(sizeof *b * N);

  //due to copy-on-write b must be initialized to get the correct bandwidth
  //also, GCC will convert malloc + memset(0) to calloc so use memset(1)
  memset(b, 1, sizeof *b * N);

  double dtime = -omp_get_wtime();
  for(int i=0; i<R; i++) mul(a,b);
  dtime += omp_get_wtime();
  printf("%.2f s, %.1f GB/s, %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);

  free(a), free(b);
}

The old solution with the timing bug

The modern solution for inline assembly is to use intrinsics. There are still cases where one needs inline assembly but this is not one of them.

One intrinsics solution for you inline assembly approach is simply:

void mul_SSE(double*  a, double*  b) {
  for (int i = 0; i<N/2; i++) 
      _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}

Let me define some test code

#include <x86intrin.h>
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>

#define N 1000000
#define R 1000

typedef __attribute__(( aligned(32)))  double aligned_double;
void  (*fp)(aligned_double *a, aligned_double *b);

void mul(aligned_double* __restrict a, aligned_double* __restrict b) {
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

void mul_SSE(double*  a, double*  b) {
  for (int i = 0; i<N/2; i++) _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}

void mul_SSE_NT(double*  a, double*  b) {
  for (int i = 0; i<N/2; i++) _mm_stream_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}

void mul_SSE_OMP(double*  a, double*  b) {
  #pragma omp parallel for
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

void test(aligned_double *a, aligned_double *b, const char *name) {
  double dtime;
  const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
  const double maxbw = 34.1;
  dtime = -omp_get_wtime();
  for(int i=0; i<R; i++) fp(a,b);
  dtime += omp_get_wtime();
  printf("%s \t time %.2f s, %.1f GB/s, efficency %.1f%%\n", name, dtime, mem/dtime, 100*mem/dtime/maxbw);
}

int main() {
  double *a = (double*)_mm_malloc(sizeof *a * N, 32);
  double *b = (double*)_mm_malloc(sizeof *b * N, 32);

  //b must be initialized to get the correct bandwidth!!!
  memset(a, 1, sizeof *a * N);
  memset(b, 1, sizeof *a * N);

  fp = mul,         test(a,b, "mul        ");
  fp = mul_SSE,     test(a,b, "mul_SSE    ");
  fp = mul_SSE_NT,  test(a,b, "mul_SSE_NT ");
  fp = mul_SSE_OMP, test(a,b, "mul_SSE_OMP");

  _mm_free(a), _mm_free(b);
}

Now the first test

g++ -O2 -fopenmp test.cpp
./a.out
mul              time 1.67 s, 13.1 GB/s, efficiency 38.5%
mul_SSE          time 1.00 s, 21.9 GB/s, efficiency 64.3%
mul_SSE_NT       time 1.05 s, 20.9 GB/s, efficiency 61.4%
mul_SSE_OMP      time 0.74 s, 29.7 GB/s, efficiency 87.0%

So with -O2 which does not vectorize loops we see that the intrinsic SSE version is much faster than the plain C solution mul. efficiency = bandwith_measured/max_bandwidth where the max is 34.1 GB/s for my system.

Second test

g++ -O3 -fopenmp test.cpp
./a.out
mul              time 1.05 s, 20.9 GB/s, efficiency 61.2%
mul_SSE          time 0.99 s, 22.3 GB/s, efficiency 65.3%
mul_SSE_NT       time 1.01 s, 21.7 GB/s, efficiency 63.7%
mul_SSE_OMP      time 0.68 s, 32.5 GB/s, efficiency 95.2%

With -O3 vectorizes the loop and the intrinsic function offers essentially no advantage.

Third test

g++ -O3 -fopenmp -funroll-loops test.cpp
./a.out
mul              time 0.85 s, 25.9 GB/s, efficency 76.1%
mul_SSE          time 0.84 s, 26.2 GB/s, efficency 76.7%
mul_SSE_NT       time 1.06 s, 20.8 GB/s, efficency 61.0%
mul_SSE_OMP      time 0.76 s, 29.0 GB/s, efficency 85.0%

With -funroll-loops GCC unrolls the loops eight times and we see a significant improvement except for the non-temporal store solution and not real advantage for OpenMP solution.

Before unrolling the loop the assembly for mul wiht -O3 is

    xor     eax, eax
.L2:
    movupd  xmm0, XMMWORD PTR [rsi+rax]
    mulpd   xmm0, XMMWORD PTR [rdi+rax]
    movaps  XMMWORD PTR [rdi+rax], xmm0
    add     rax, 16
    cmp     rax, 8000000
    jne     .L2
    rep ret

With -O3 -funroll-loops the assembly for mul is:

   xor     eax, eax
.L2:
    movupd  xmm0, XMMWORD PTR [rsi+rax]
    movupd  xmm1, XMMWORD PTR [rsi+16+rax]
    mulpd   xmm0, XMMWORD PTR [rdi+rax]
    movupd  xmm2, XMMWORD PTR [rsi+32+rax]
    mulpd   xmm1, XMMWORD PTR [rdi+16+rax]
    movupd  xmm3, XMMWORD PTR [rsi+48+rax]
    mulpd   xmm2, XMMWORD PTR [rdi+32+rax]
    movupd  xmm4, XMMWORD PTR [rsi+64+rax]
    mulpd   xmm3, XMMWORD PTR [rdi+48+rax]
    movupd  xmm5, XMMWORD PTR [rsi+80+rax]
    mulpd   xmm4, XMMWORD PTR [rdi+64+rax]
    movupd  xmm6, XMMWORD PTR [rsi+96+rax]
    mulpd   xmm5, XMMWORD PTR [rdi+80+rax]
    movupd  xmm7, XMMWORD PTR [rsi+112+rax]
    mulpd   xmm6, XMMWORD PTR [rdi+96+rax]
    movaps  XMMWORD PTR [rdi+rax], xmm0
    mulpd   xmm7, XMMWORD PTR [rdi+112+rax]
    movaps  XMMWORD PTR [rdi+16+rax], xmm1
    movaps  XMMWORD PTR [rdi+32+rax], xmm2
    movaps  XMMWORD PTR [rdi+48+rax], xmm3
    movaps  XMMWORD PTR [rdi+64+rax], xmm4
    movaps  XMMWORD PTR [rdi+80+rax], xmm5
    movaps  XMMWORD PTR [rdi+96+rax], xmm6
    movaps  XMMWORD PTR [rdi+112+rax], xmm7
    sub     rax, -128
    cmp     rax, 8000000
    jne     .L2
    rep ret

Fourth test

g++ -O3 -fopenmp -mavx test.cpp
./a.out
mul              time 0.87 s, 25.3 GB/s, efficiency 74.3%
mul_SSE          time 0.88 s, 24.9 GB/s, efficiency 73.0%
mul_SSE_NT       time 1.07 s, 20.6 GB/s, efficiency 60.5%
mul_SSE_OMP      time 0.76 s, 29.0 GB/s, efficiency 85.2%

Now the non-intrinsic function is the fastest (excluding the OpenMP version).

So there is no reason to use intrinsics or inline assembly in this case because we can get the best performance with appropriate compiler options (e.g. -O3, -funroll-loops, -mavx).

Test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM. Maximum memory bandwidth (34.1 GB/s) https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz

Here is another solution worth considering. The cmp instruction is not necessary if we count from -N up to zero and access the arrays as N+i. GCC should have fixed this a long time ago. It eliminates one instruction (though due to macro-op fusion the cmp and jmp often count as one micro-op).

void mul_SSE_v2(double*  a, double*  b) {
  for (ptrdiff_t i = -N; i<0; i+=2)
    _mm_store_pd(&a[N + i], _mm_mul_pd(_mm_load_pd(&a[N + i]),_mm_load_pd(&b[N + i])));

Assembly with -O3

mul_SSE_v2(double*, double*):
    mov     rax, -1000000
.L9:
    movapd  xmm0, XMMWORD PTR [rdi+8000000+rax*8]
    mulpd   xmm0, XMMWORD PTR [rsi+8000000+rax*8]
    movaps  XMMWORD PTR [rdi+8000000+rax*8], xmm0
    add     rax, 2
    jne     .L9
    rep ret
}

This optimization will only possibly be helpful the arrays fit e.g. the L1 cache i.e. not reading from main memory.

I finally found a way to get the plain C solution to not generate the cmp instruction.

void mul_v2(aligned_double* __restrict a, aligned_double* __restrict b) {
  for (int i = -N; i<0; i++) a[i] *= b[i];
}

And then call the function from a separate object file like this mul_v2(&a[N],&b[N]) so this is perhaps the best solution. However, if you call the function from the same object file (translation unit) as the one it's defined in the GCC generates the cmp instruction again.

Also,

void mul_v3(aligned_double* __restrict a, aligned_double* __restrict b) {
  for (int i = -N; i<0; i++) a[N+i] *= b[N+i];
}

still generates the cmp instruction and generates the same assembly as the mul function.

The function mul_SSE_NT is silly. It uses non-temporal stores which are only useful when only writing to memory but since the function reads and writes to the same address non-temporal stores are not only useless they give inferior results.

Previous versions of this answer were getting the wrong bandwidth. The reason was when the arrays were not initialized.

INTEL SIMD: why is inplace multiplication so slow?

Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.

(Turns out, yes that was the problem for the OP).

Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.

It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.

I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.

There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

Why my AVX2 horizontal addition function is not faster than non-SIMD addition?

The intuition might be that this step...

temp8[0]+temp8[1]+temp8[2]+temp8[3]+temp8[4]+temp8[5]+temp8[6]+temp8[7]

is the expensive part that vectorization should accelerate, but it is probably wrong. Addition is a single muop, and you can execute 4 per cycle on a recent x64 machine as long as you work over registers (not memory). So, in theory, your processor can do this...

cycle 1.

temp8[0]+temp8[1]
temp8[2]+temp8[3]
temp8[4]+temp8[5]
temp8[6]+temp8[7]

cycle 2

(temp8[0]+temp8[1])+(temp8[2]+temp8[3]) 
(temp8[4]+temp8[5])+(temp8[6]+temp8[7])

and get the answer on cycle 3, with capacity to spare. (Our processors are superscalar and have an out-of-order pipeline so this will happen magically.)

How much faster can a vectorized approach be? You gave us the answer...

a = _mm256_hadd_epi32(a, a_hi);
a = _mm256_hadd_epi32(a, a);
a = _mm256_hadd_epi32(a, a);

We can recognize the 3 cycles... Of course, it looks cheaper, maybe... but what is probably underneath the _mm256_hadd_epi32 intrinsic is the PHADD instruction (~3 muops, 1 instruction every two cycles). The important point is that the processor cannot execute several of the _mm256_hadd_epi32 intrinsics simultaneously while it can do several of the scalar additions simultaneously. Thus, you see, which is faster becomes a technical matter.

Anyhow, to sum up my answer... You should not expect vectorization to help in this instance (at least not help a lot) because it is going against superscalar execution of cheap instructions (add).

Appendix. This code

  _mm256_store_si256((__m256i *)&temp8[0] , vec);
  c_result[i][j]= temp8[0]+temp8[1]+temp8[2]+temp8[3]+temp8[4]+temp8[5]+temp8[6]+temp8[7];

probably does not get compiled to what you think. Let us flush it out as a function

uint32_t hadd32(__m256i vector) {
  uint32_t buffer[sizeof(__m256i)/sizeof(uint32_t)];
 _mm256_store_si256((__m256i *)buffer , vector);
 uint32_t answer = buffer[0]+buffer[1]+buffer[2]+buffer[3]+buffer[4]+buffer[5]+buffer[6]+buffer[7];
 return answer;
}

Several compilers (clang, GCC 7), compile this down to

    vpextrd edx, xmm0, 1
    vmovd   eax, xmm0
    add     eax, edx
    vpextrd edx, xmm0, 2
    add     eax, edx
    vpextrd edx, xmm0, 3
    vextracti128    xmm0, ymm0, 0x1
    add     eax, edx
    vmovd   edx, xmm0
    add     eax, edx
    vpextrd edx, xmm0, 1
    add     eax, edx
    vpextrd edx, xmm0, 2
    add     eax, edx
    vpextrd edx, xmm0, 3
    add     eax, edx

where we recognize the additions, but where the temporary buffer as been ignored entirely in favor of vpextrd calls. The lesson here is to always look at the generated assembly.

Simd matmul program gives different numerical results

This looks normal; adding numbers in a different order produces different rounding in the temporaries.

FP math is not associative; optimizing as if it is will change the results.¹ Is Floating point addition and multiplication associative? / Are floating point operations in C associative?

The amount of change depends on the data. Differences only in the 5th decimal place seems reasonable for float.

Unless you were taking special numerical precautions like adding up the small numbers first, the sequential-order result isn't "more correct", they just have different errors.

In fact, using multiple accumulators generally increases precision for large lists, assuming your numbers all have similar magnitude. (Ideally multiple SIMD vectors each composed of multiple elements, to hide FP-add or FMA latency).
https://en.wikipedia.org/wiki/Pairwise_summation is a numerical technique that takes this to the next level: summing subsets of the list in a tree, to avoid adding single array elements to a much larger value. See for example, How to avoid less precise sum for numpy-arrays with multiple columns

Using a fixed number of accumulators (e.g. 8x __m256 = 64 float accumulators) might reduce expected error by a factor of 64, instead of from N to log N for full pairwise summation.

Footnote 1: Associativity is necessary for parallelization, and SIMD, and multiple accumulators. Associativity gives us parallelizability. But what does commutativity give?

On a machine with for example 4-cycle latency 2-per-clock throughput FMA, with a SIMD width of 8 floats, i.e. a Skylake system with AVX2, the potential speedup is 4*2 = 8 from multiple accumulators, * 8 from SIMD width, times number of cores, vs. a pure sequential version, even for problems where it might be less accurate instead of just different.

Most people consider a factor of 8*8 = 64 worth it! (And you can in theory also parallelize for another factor of maybe 4 on a quad-core, assuming perfect scaling for large matrices).

You're already using float instead of double for performance.

See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about using multiple accumulators to hide FMA latency in a reduction, exposing that other factor of 8 speedup.

Also, do not use hadd inside an inner-most loop. Sum vertically, and use an efficient reduction at the end of the loop. (Fastest way to do horizontal float vector sum on x86). You really really want to avoid having the compiler extract your vectors to scalar at every step, that defeats most of the benefit of SIMD! Besides the fact that hadd is not worth using for horizontal sums of 1 vector; it costs 2 shuffles + a regular add on all existing CPUs.

Is there a faster way to multiply by 2 on SIMD (without using muliplication)?

I'm still trying to find an example of where this would make a difference. My gut feeling is that if latency is an issue there are cases where x+x would be better but if latency is not an issue and only throughput matters then it could be worse. But first let's discuss some hardware.

Let me stick to Intel x86 processors since that's what I know best. Let's consider the following generations of hardware: Core2/Nehalem, SandyBridge/IvyBridge, and Haswell/Broadwell.

Latency and throughput for SIMD floating pointer arithmetic operations:

The latency for addition is 3.
Except for Broadwell the latency for multiplication is 5.
On Broadwell multiplication has a latency of 3.
The throughput for addition is 1.
Except for Haswell and Broadwell the throughput for multiplication is one 1.
On Haswell and Broadwell the throughput for multiplicatin is 2.
The throughput for addition and multiplication without FMA is 2.
The latency for FMA is 5
The throughput for FMA is 2. This is equivalent to an addition and multiplication throughput of 4.

Here is a case I'm actually using for generating the Mandelbrot set which has a factor of 2. In the main loop two of the most critical lines of code are:

x = x*x - y*y + x0;
y = 2*xtemp*y + y0;

All of the varible here are SIMD (SSE or AVX) registers so I'm acting on multiple pixels at once (4 with SSE, 8 with AVX for single floating point). I'm using a SIMD class wrapped around the intrinsics for this. For y I could do instead

y = xtemp*y + xtemp*y + y0

What about with FMA?

y = fma(2*xtemp, y, y0)

y = xtemp*y + fma(xtemp, y, y0);

There are many variations that could be tried. I have not tried y=xtemp*y + xtemp*y + y0 but I think it would be worse. Incidentally FMA turns out, so far the way I have implemented it on my Haswell system, not to help much. My frame rate only increases 15% or so using FMA whereas when I go from using 4 pixels with SSE to 8 pixels with AVX it nearly doubles.

Edit: here are some cases which I though would make a difference but either they don't in practice or they would not make sense to do.

Consider this case

for(int i=0; i<n; i++) y[i] = 2*x[i];

In this case latency does not matter, throughput matters. On Haswell and Broadwell the throughput for multiplication is twice addition so in this case so it might seem that it would be worse to do x+x but since Haswell/Broadwell can only write 32-bytes per clock cycle it does not make a difference.

Here is a case where using x+x would seem to be better.

for(int i=0; i<n; i++) prod = prod * (2*x[i]);

Instead you could do this:

for(int i=0; i<n; i++) prod = prod * (x[i]+x[i]);

In both these cases it would make no difference since they are dominated by the latency of the multiplication of prod. However, if you unrolled the loops enough times so that latency did not matter then the second case would be better in general since all the processors can do addition and multiplication at least every clock cycle. Although Haswell and Broadwell can do two multiplications per clock cycle they can also do two multiplications and additions per clock cycle using FMA so even on them this would be better.

However, in this case the smart thing to do is

for(int i=0; i<n; i++) prod *= x[i];
prod *= pow(2,n);

So it would not be necessary to do x+x instead of 2*x.

Speeding up matrix multiplication using SIMD and openMP

Parallelism can only give you so much. Furthermore, the more the sequential code is optimized the lesser will be the perceptual gains from parallelism you are likely to get. Nevertheless, what you can improve -- I have done it in the past and help to improve a lot -- is to divide the matrix multiplication into smaller blocks. Hence, the matrix multiplication is sub-divided into the multiplication of smaller matrices (tiles).

So by dividing the matrix multiplication into smaller blocks where you perform the matrix multiplication of smaller sub-matrices you are improving the use of the cache both the temporal locality and spatial locality. You need to divide the matrices according to the size of the cache levels (e.g., L1, L2, and L3) of the architecture that you are using. You can look in more detail in these slides about cache blocking and Matrix Multiplication. What Every Programmer Should Know About Memory? also has a vectorized cache-blocked matmul in an appendix.

For example, if you have a Cache L3, which is shared among cores you can load multiple columns of the matrix B into the L3 cache, and reuse those values to perform the matrix-multiplication of smaller tiles that will fit in cache L1 and L2. You can go even further, and inside those tiles divide the tiles even further so that you can take advantage of the registers.

After having optimized the memory usage of the matrix-multiplication, you can try to get that extra speedup from multithreading. If you have a cluster of multi-cores you can try to parallelize using MPI + OpenMP, of course at this point you will encounter another bottleneck, communication among processes.

This all depends on the architecture where your code is running, if you have a NUMA architecture then you have to also factor in things like local and non-local memory. You can also explore the GPU route: Matrix-Matrix Multiplication on the GPU with Nvidia CUDA.

Have a look at BLAS to have a good inside into efficient code.

Why Is This Simd Multiplication Not Faster Than Non-Simd Multiplication