Measuring Memory Bandwidth from the Dot Product of Two Arrays

Measuring memory bandwidth from the dot product of two arrays

There's a few things going on here, that come down to:

You have to work fairly hard to get every last bit of performance out of the memory subsystem; and
Different benchmarks measure different things.

The first helps explain why you need multiple threads to saturate the available memory bandwidth. There is a lot of concurrency in the memory system, and it taking advantage of that will often require some concurrency in your CPU code. One big reason that multiple threads of execution help is latency hiding - while one thread is stalled waiting for data to arrive, another thread may be able to take advantage of some other data that has just become available.

The hardware helps you a lot on a single thread in this case - because the memory access is so predictable, the hardware can prefetch the data ahead of when you need it, giving you some of the advantage of latency hiding even with one thread; but there are limits to what prefetch can do. The prefetcher won't take it upon itself to cross page boundaries, for instance. The canonical reference for much of this is What Every Programmer Should Know About Memory by Ulrich Drepper, which is now old enough that some gaps are starting to show (Intel's Hot Chips overview of your Sandy Bridge processor is here - note in particular the tighter integration of the memory management hardware with the CPU).

As to the question about comparing with memset, mbw or STREAM, comparing across benchmarks will always cause headaches, even benchmarks that claim to be measuring the same thing. In particular, "memory bandwidth" isn't a single number - performance varies quite a bit depending on the operations. Both mbw and Stream do some version of a copy operation, with STREAMs operations being spelled out here (taken straight from the web page, all operands are double-precision floating points):

------------------------------------------------------------------
name        kernel                  bytes/iter      FLOPS/iter
------------------------------------------------------------------
COPY:       a(i) = b(i)                 16              0
SCALE:      a(i) = q*b(i)               16              1
SUM:        a(i) = b(i) + c(i)          24              1
TRIAD:      a(i) = b(i) + q*c(i)        24              2
------------------------------------------------------------------

so roughly 1/2-1/3 of the memory operations in these cases are writes (and everything's a write in the case of memset). While individual writes can be a little slower than reads, the bigger issue is that it's much harder to saturate the memory subsystem with writes because of course you can't do the equivalent of prefetching a write. Interleaving the reads and writes helps, but your dot-product example which is essentially all reads is going to be about the best-possible case for pegging the needle on memory bandwidth.

In addition, the STREAM benchmark is (intentionally) written completely portably, with only some compiler pragmas to suggest vectorization, so beating the STREAM benchmark isn't necessarily a warning sign, especially when what you're doing is two streaming reads.

Measuring bandwidth on a ccNUMA system

Bandwidth performance depends on the type of operation you do. For a mix of reads & writes you will indeed not get the peak number; if you only do reads you will get closer.

I suggest you read the documentation for the "Stream benchmark", and take a look at the posted numbers.

Further notes: I hope you tie your threads down with OMP_PROC_BIND? Also, your architecture runs out of bandwidth before it runs out of cores. Your optimal bandwidth performance may happen with less than the total number of cores.

OpenMP Dot Product and Pointers

There's several things wrong here.

Let's look at the loop as it stands:

#pragma omp parallel shared(y, z) private(i, tid)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  #pragma omp parallel for reduction(+:x_par)
  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

So (1) notice that you (presumably) want x_par to be accessible outside of this region. So you'll want the reduction(+:x_par) on the outer section, not the inner. If you also add the very helpful default(none) clause, you'll also find that there's no clause describing the sharing of nthreads; let's explicitly make that shared.

So let's look again:

#pragma omp parallel shared(y, z, nthreads) private(i, tid) reduction(+:x_par) default(none)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  #pragma omp parallel for 
  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

So looking more carefully, we now see that you have two omp parallel sections. That means, if nested parallelism is enabled, that you'll have nthreads tasks each launching nthreads taks to do that loop; so the loop would end up with nthreads times the right answer if everything worked. So let's get rid of the parallel, just using the for:

 #pragma omp parallel shared(y, z, nthreads) private(i, tid) reduction(+:x_par) default(none)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  #pragma omp for 
  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

So that has the sharing correct and isn't nesting parallelism, but it still doesn't give the right answer; it gives a much smaller result. What's wrong? Let's take a look at the for loop. Each thread wants to start at tid and skip nthreads, fine; but why do we want the omp for there?

Let's take a look at a simpler version which works:

#pragma omp parallel shared(y, z) reduction(+:x_par) default(none)
{
  #pragma omp for
  for (i=0; i<N; i++)
  {
      x_par += y[i] * z[i];
  }
}

Note that here we aren't decomposing the loop explicitly with tid and nthreads - we don't have to, because omp for decomposes the loop for us; it assigns loop iterations to threads.

So looking back at what we have, we have a manual decomposition of the loop - which is fine, sometimes that's what you need to do; and an omp for which tries to take that loop and split it up amongst threads. But we're already doing that; the omp for simply makes us skip iterations here!

So getting rid of the omp for

#pragma omp parallel shared(y, z, nthreads) private(i, tid) reduction(+:x_par) default(none)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

Gets us the right answer.

What is maximum (ideal) memory bandwidth of an OpenCL device?

Memory bank quantity is useful to hint about strided access pattern performance and bank conflicts.

Cache line width must be the L2 cache line between L2 and CU(L1). 64 bytes per cycle means 64GB/s per compute unit (assuming there is only 1 active cache line per CU at a time and 1GHz clock). There can be multiple like 4 of them per L1 too.). With 20 compute units, total "L2 to L1" bandwidth must be 1.28TB/s but its main advantage against global memory must be lower clock cycles to fetch data.

If you need to utilize global memory, then you need to approach bandwidth limits between L2 and main memory. That is related to memory channel width, number of memory channels and frequency.

Gddr channel width is 64 bits, HBM channel width is 128 bits. A single stack of hbm v1 has 8 channels so its a total of 1024 bits or 128 bytes. 128 bytes per cycle means 128GB/s per GHz. More stacks mean more bandwidth. If 8GB memory is made of two stacks, then its 256 GB/s.

If your data-set fits inside L2 cache, then you expect more bandwidth under repeated access.

But the true performance (instead of on paper) can be measured by a simple benchmark that does pipelined memory copy between two arrays.

Total performance by 8 work items depends on capability of compute unit. If it lets only 32 bytes per clock per work item then you may need more work items. Compute unit must have some optimization phase like packing of similar addresses into one big memory access by each CU. So you can even achieve max performance using only single work group (but using multiple work items, not just 1, the number depends on how big of an object each work item is accessing and its capability). You can benchmark this on an array-summation or reduction kernel. Just 1 compute unit is generally more than enough to utilize global memory bandwidth unless its single L2-L1 bandwidth is lower than the global memory bandwidth. But may not be true for highest-end cards.

What is the parallelism between L2 and L1 for your card? Only 1 active line at a time? Then you probably rewuire 8 workitems distributed on 8 work groups.

According to datasheet from amd about rdna, each shader is capable to do 10-20 requests in flight so if 1 rdna compute unit L1-L2 communication is enough to use all bw of global mem, then even just a few workitems from single work group should be enough.

L1-L2 bandwidth:

Sample Image

It says 4 lines active between each L1 nad the L2. So it must have 256GB/s per compute unit. 4 workgroups running on different CU should be enough for a 1TB/s main memory. I guess OpenCL has no access to this information and this can change for new cards so best thing would be to benchmark for various settings like from 1 CU to N CU, from 1 work item to N work items. It shouldn't take much time to measure under no contention (i.e. whole gpu server is only dedicated to you).

Shader bandwidth:

Sample Image

If these are per-shader limits, then a single shader can use all of its own CU L1-L2 bandwidth, especially when reading.

Also says L0-L1 cache line size is 128 bytes so 1 workitem could be using that wide data type.

N-way-set-associative cache (L1, L2 in above pictures) and direct-mapped cache (maybe texture cache?) use the modulo mapping. But LRU (L0 here) may not require the modulo access. Since you need global memory bandwidth, you should look at L2 cache line which is n-way-set-associative hence the modulo. Even if data is already in L0, the OpenCL spec may not let you do non-modulo-x access to data. Also you don't have to think about alignment if the array is of type of the data you need to work with.

If you dont't want to fiddle with microbenchmarking and don't know how many workitems required, then you can use async workgroup copy commands in kernel. The async copy implementation uses just the required amount of shaders (or no shaders at all? depending on hardware). Then you can access the local memory fast, from single workitem.

But, a single workitem may require an unrolled loop to do the pipelining to use all the bandwidth of its CU. Just a single read/write operation will not fill the pipeline and make the latency visible (not hidden behind other latencies).

Note: L2 clock frequency can be different than main memory frequency, not just 1GHz. There could be a L3 cache or something else to adapt a different frequency in there. Perhaps its the gpu frequency like 2GHz. Then all of the L1 L0 bandwidths are also higher, like 512 GB/s per L1-L2 communication. You may need to query CL_DEVICE_MAX_CLOCK_FREQUENCY for this. In any way, just 1 CU looks like capable of using bandwidth of 90% of high-end cards. An RX6800XT has 512GB/s main memory bandwidth and 2GHz gpu so likely it can use only 1 CU to do it.

Optimizing linear access to arrays with pre-fetching and cache in C

Your value for the peak bandwidth from main memory is off by a factor of two. Instead of it 10664 MB/s it should be 21.3 GB/s (more precisely it should be (21333⅓) MB/s - see my derivation below). The fact that you see more than 10664 MB/s sometimes should have told you that maybe there was a problem in your peak bandwidth calculation.

In order to get the maximum bandwidth for Core2 through Sandy Bridge you need to use non-temporal stores. Additionally, you need multiple threads. You don't need AVX instructions or to unroll the loop.

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i]));
    }
}

The arrays need to be 16 byte aligned and also be a multiple of 16. The rule of thumb for non-temporal stores is to use them when the memory you are copying is larger than half the size of last level cache. In your case half the L3 cache size is 1.5 MB and the smallest array you copy is 8 MB so this is much larger than half the last level cache size.

Here is some code to test this.

//gcc -O3 -fopenmp foo.c
#include <stdio.h>
#include <x86intrin.h>
#include <string.h>
#include <omp.h>

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

void copy2(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_store_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

int main(void)
{
    unsigned n = 0x7fffffff;
    char *x = _mm_malloc(n, 16);
    char *y = _mm_malloc(n, 16);
    double dtime;

    memset(x,0,n);
    memset(y,1,n);

    dtime = -omp_get_wtime();
    copy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    copy2(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    memcpy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);  
}

On my system, Core2 (before Nehalem) P9600@2.53GHz, it gives

time non temporal store 0.39
time SSE store          1.10
time memcpy             0.98

to copy 2GB.

Note that it's very important that you "touch" the memory you will write to first (I used memset to do this). Your system does not necessarily allocate your memory until you access it. The overhead to do this can bias your results significantly if the memory has not been accesses when you do the memory copy.

According to wikipedia DDR3-1333 has a memory clock of 166⅔ MHz. DDR transfers data at twice memory clock rate. Additionally, DDR3 has a bus clock multiplier of four. So DDR3 has a total multiply per memory clock of eight. Additionally, your motherboard has two memory channels. So the total transfer rate is

 21333⅓ MB/s = (166⅔ 1E6 clocks/s) * (8 lines/clock/channel) * (2 channels) * (64-bits/line) * (byte/8-bits) * (MB/1E6 bytes).

How Do I Attain Peak CPU Performance With Dot Product?

To answer your overall question you can't achieve peak performance with the dot product.

The problem is that your CPU can do one 128-bit load per clock cycle and to do the dot product you need two 128-bit loads per clock cycle.

But it's worse than that for large n. The answer to your second question is that the dot product is memory bound and not compute bound and so it cannot parallelize for large n with fast cores. This is explained better here why-vectorizing-the-loop-does-not-have-performance-improvement. This is a big problem with parallelization with fast cores. It took me a while to figure this out but it's very important to learn.

There are actually few basic algorithms that can fully benefit from parallelization on fast cores. In terms of BLAS algorithms it's only the Level-3 algorithms (O(n^3)) such as matrix multiplication that really benefit from parallelization. The situation is better on slow cores e.g. with GPUs and the Xeon Phi because the discrepancy between memory speed and core speed is much smaller.

If you want to find an algorithm which can get close to peak flops for small n try e.g. scalar * vector or the sum of scalar * vector. The first case should do one load, one mult, and one store every clock cycle and the second case one mult, one add, and one load every clock cycle.

I tested the following code on a Core 2 Duo P9600@2.67GHz in Knoppix 7.3 32-bit. I get about 75% of the peak for the scalar product and 75% of the peakfor the sum of the scalar product. The flops/cycle for the scalar product is 2 and for the sum of the scalar product it's 4.

Compiled with g++ -msse2 -O3 -fopenmp foo.cpp -ffast-math

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <x86intrin.h>

void scalar_product(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    double k = 3.14159;
    for(int i=0; i<n; i++) {
        a[i] = k*a[i]; 
    }
}

void scalar_product_SSE(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    __m128d k = _mm_set1_pd(3.14159);    
    for(int i=0; i<n; i+=8) {
        __m128d t1 = _mm_load_pd(&a[i+0]);
        _mm_store_pd(&a[i],_mm_mul_pd(k,t1));
        __m128d t2 = _mm_load_pd(&a[i+2]);
        _mm_store_pd(&a[i+2],_mm_mul_pd(k,t2));
        __m128d t3 = _mm_load_pd(&a[i+4]);
        _mm_store_pd(&a[i+4],_mm_mul_pd(k,t3));
        __m128d t4 = _mm_load_pd(&a[i+6]);
        _mm_store_pd(&a[i+6],_mm_mul_pd(k,t4));
    }
}

double scalar_sum(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    double sum = 0.0;    
    double k = 3.14159;
    for(int i=0; i<n; i++) {
        sum += k*a[i]; 
    }
    return sum;
}

double scalar_sum_SSE(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    __m128d sum1 = _mm_setzero_pd();
    __m128d sum2 = _mm_setzero_pd();
    __m128d sum3 = _mm_setzero_pd();
    __m128d sum4 = _mm_setzero_pd();
    __m128d k = _mm_set1_pd(3.14159);   
    for(int i=0; i<n; i+=8) {
        __m128d t1 = _mm_load_pd(&a[i+0]);
        sum1 = _mm_add_pd(_mm_mul_pd(k,t1),sum1);
        __m128d t2 = _mm_load_pd(&a[i+2]);
        sum2 = _mm_add_pd(_mm_mul_pd(k,t2),sum2);
        __m128d t3 = _mm_load_pd(&a[i+4]);
        sum3 = _mm_add_pd(_mm_mul_pd(k,t3),sum3);
        __m128d t4 = _mm_load_pd(&a[i+6]);
        sum4 = _mm_add_pd(_mm_mul_pd(k,t4),sum4);      
    }
    double tmp[8];
    _mm_storeu_pd(&tmp[0],sum1);
    _mm_storeu_pd(&tmp[2],sum2);
    _mm_storeu_pd(&tmp[4],sum3);
    _mm_storeu_pd(&tmp[6],sum4);
    double sum = 0;
    for(int i=0; i<8; i++) sum+=tmp[i];
    return sum;
}

int main() {
    //_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
    //_mm_setcsr(_mm_getcsr() | 0x8040);
    double dtime, peak, flops, sum;
    int repeat = 1<<18;
    const int n = 2048;
    double *a = (double*)_mm_malloc(sizeof(double)*n,64);
    double *b = (double*)_mm_malloc(sizeof(double)*n,64);
    for(int i=0; i<n; i++) a[i] = 1.0*rand()/RAND_MAX;

    dtime = omp_get_wtime();
    for(int r=0; r<repeat; r++) {
        scalar_product_SSE(a,n);
    }
    dtime = omp_get_wtime() - dtime;
    peak = 2*2.67;
    flops = 1.0*n/dtime*1E-9*repeat;
    printf("time %f, %f, %f\n", dtime,flops, flops/peak);

    //for(int i=0; i<n; i++) a[i] = 1.0*rand()/RAND_MAX;
    //sum = 0.0;    
    dtime = omp_get_wtime();
    for(int r=0; r<repeat; r++) {
        scalar_sum_SSE(a,n);
    }
    dtime = omp_get_wtime() - dtime;
    peak = 2*2*2.67;
    flops = 2.0*n/dtime*1E-9*repeat;
    printf("time %f, %f, %f\n", dtime,flops, flops/peak);
    //printf("sum %f\n", sum);

}

my intrinsic function in getting the dot product of an int array is slower than the normal code, what am I doing wrong?

So with @Peter Cordes, @Qubit and @j6t suggestions, I tweeked the code a little bit, I now only do multiplication inside the loop, then I moved the horizontal addition outside the loop... It managed to increase the performance of the intrinsic version from around 97529675 nanoseconds, to around 56444187 nanoseconds which is significantly faster than my previous implementation, with the same compilation flags and 10000000 elements of int array.

here is the new function from main.hpp

int _sse_int_dot(int* a, int* b, size_t len){
        
    size_t vec_loop = len/4;
    size_t non_vec = len%4;
    size_t start_non_vec_i = len-non_vec;

    int dot_product;
    __m128i vdot_product = _mm_set1_epi32(0);

    for(size_t i=0; i<vec_loop; ++i)
    {
        __m128i va = _mm_loadu_si128((__m128i*)(a+(i*4)));
        __m128i vb = _mm_loadu_si128((__m128i*)(b+(i*4)));
        __m128i vc = _mm_mullo_epi32(va,vb);
        vdot_product = _mm_add_epi32(vdot_product,vc);
    }

    vdot_product = _mm_hadd_epi32(vdot_product,vdot_product);
    vdot_product = _mm_hadd_epi32(vdot_product,vdot_product);
    dot_product = _mm_cvtsi128_si32(vdot_product);

    for(size_t i=start_non_vec_i; i<len; ++i) dot_product += a[i]*b[i];

    return dot_product;
}

If there is more to improve with this code, please point it out, for now I'm just gonna leave it here as the answer.

Measuring Memory Bandwidth from the Dot Product of Two Arrays