_Builtin_Prefetch, How Much Does It Read

__builtin_prefetch, How much does it read?

I think it just emit one FETCH machine instruction, which basically fetches a line cache, whose size is processor specific.

And you could use __builtin_prefetch (con[i+3].Pfrom) for instance. By my (small) experience, in such a loop, it is better to prefetch several elements in advance.

Don't use __builtin_prefetch too often (i.e. don't put a lot of them inside a loop). Measure the performance gain if you need them, and use GCC optimization (at least -O2). If you are very lucky, manual __builtin_prefetch could increase the performance of your loop by 10 or 20% (but it could also hurt it).

If such a loop is crucial to you, you might consider running it on GPUs with OpenCL or CUDA (but that requires recoding some routines in OpenCL or CUDA language, and tuning them to your particular hardware).

Use also a recent GCC compiler (the latest release is 4.6.2) because it is making a lot of progress on these areas.

^{(added in january 2018:)}

Both hardware (processors) and compilers have made a lot of progress regarding caches, so it seems that using __builtin_prefetch is less useful today (in 2018). Be sure to benchmarck.

__builtin_prefetch making it faster in my code. What I need to do in the code

What I need to do it correctly so my __builtin_prefetch code run faster

You need to remove __builtin_prefetch. It's literally the only instruction that differs between code snippets. Compiler optimized your whole code to a no-op, as there are no side effects in your code.

Your first code snippet is compiled to:

main:
        xor     eax, eax
        ret

While your second code is compiled to:

main:
        xor     eax, eax
        prefetcht0      [rsp-24]
        ret

Even if you do return partial for example, the compiler is able to calculate the entire result at compile time and reduce the entire program to just return <constant>.

You can inspect the generated assembly of your programs with ease using https://godbolt.org/ .

Why doesn’t __builtin_prefetch have any effect here?

That's because the hardware is years ahead of you. :)

There are hardware prefetchers that are designed to recognize simple patterns and do the prefetching for you. In this case, you have a simple sequential access pattern, that's more than trivial for the hardware prefetcher.

Manual prefetching only comes handy when you have access patterns that the hardware cannot predict.

Here's one such example: Prefetching Examples?

How to fetch a memory location in C++ without waiting for its retrieval?

In portable C++, I'd solve it as follows:

result = data[i]; // Unconditional!
auto offset = offset_shifts[i];
if (offset)
   result = data[i+offset];

The rationale is that result is likely just a register, so result = data[i]; is effectively just a read. This will start the read, but not block the CPU pipeline for the next operation. offset_shifts[i] is effectively retrieved in parallel with the previous operation. (The optimizer may even swap the two operations - it knows more about CPU's than I do). If the branch is taken, you get the intended cache effect. If not taken, the operation is as effective as it can be.

When should we use prefetch?

This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code.

Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where youre bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously.

Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful.

The link by @Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW:

#include "stdio.h"
#include "sys/timeb.h"
#include "emmintrin.h"

#define N 4096
#define REP 200
#define ELEM int

int main() {
    int i,j, k, b;
    const int blksize = 64 / sizeof(ELEM);
    ELEM __attribute ((aligned(4096))) a[N][N];
    for (i = 0; i < N; ++i) {
        for (j = 0; j < N; ++j) {
            a[i][j] = 1;
        }
    }
    unsigned long long int sum = 0;
    struct timeb start, end;
    unsigned long long delta;

    ftime(&start);
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j ++) {
                sum += a[i][j];
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta); 

    ftime(&start);
    sum = 0;
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j += blksize) {
                for (b = 0; b < blksize; ++b) {
                    sum += a[i][j+b];
                }
                _mm_prefetch(&a[i+1][j], _MM_HINT_T2);
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching on:  N=%d, sum=%lld, time=%lld\n", N, sum, delta); 
}

What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup.

Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670:

Prefetching off: N=4096, sum=3355443200, time=1839
Prefetching on:  N=4096, sum=3355443200, time=1502

Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches.

Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try i'll be happy to know). In that case i'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers.

Skylake results

Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with gcc:

Compile flags: -O3 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=896
Prefetching on:  N=4096, sum=28147495993344000, time=1222
Prefetching off: N=4096, sum=28147495993344000, time=886
Prefetching on:  N=4096, sum=28147495993344000, time=1291
Prefetching off: N=4096, sum=28147495993344000, time=890
Prefetching on:  N=4096, sum=28147495993344000, time=1234
Prefetching off: N=4096, sum=28147495993344000, time=848
Prefetching on:  N=4096, sum=28147495993344000, time=1220
Prefetching off: N=4096, sum=28147495993344000, time=852
Prefetching on:  N=4096, sum=28147495993344000, time=1253

Compile flags: -O2 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1813
Prefetching off: N=4096, sum=28147495993344000, time=1956
Prefetching on:  N=4096, sum=28147495993344000, time=1814
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1961
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1965
Prefetching on:  N=4096, sum=28147495993344000, time=1814

So using prefetch is either about 40% slower, or 8% faster depending on if you use -O3 or -O2 respectively for this particular example. The big slowdown for -O3 is actually due to a code generation quirk: at -O3 the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway.

So the -O2 results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown.

This test probably isn't ideal in that by going int by int implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much).

Difference between prefetch for read or write

On the CPU level, a software prefetch (as opposed to ones trigger by the hardware itself) are a convenient way to hint to the CPU that a line is about to be accessed, and you want it prefetched in advance to save the latency.

If the access will be a simple read, you would want a regular prefetch, which would behave similarly to a normal load from memory (aside from not blocking the CPU in case it misses, not faulting in case the address is bad, and all sorts of other benefits, depending on the micro architecture).

However, if you intend to write to that line, and it also exists in another core, a simple read operation would not suffice. This is due to MESI-based cache handling protocols. A core must have ownership of a line before modifying it, so that it preserves coherency (if the same line gets modified in multiple cores, you will not be able to ensure correct ordering for these changes, and may even lose some of them, which is not allowed on normal WB memory types).
Instead, a write operation will start by acquiring ownership of the line, and snooping it out of any other core / socket that may hold a copy. Only then can the write occur.
A read operation (demand or prefetch) would have left the line in other cores in a shared state, which is good if the line is read multiple times by many cores, but doesn't help you if your core later writes to it.

To allow useful prefetching for lines that will later be written to, most CPU companies support special prefetches for writing. In x86, both Intel and AMD support the prefetchW instruction, which should have the effect of a write (i.e. - acquiring sole ownership of a line, and invalidating any other copy if it). Note that not all CPUs support that (even within the same family, not all generations have it), and not all compiler versions enable it.

Here's an example (with gcc 4.8.2) - note that you need to enable it explicitly here -

#include <emmintrin.h>

int main() {
    long long int a[100];
    __builtin_prefetch (&a[0], 0, 0);
    __builtin_prefetch (&a[16], 0, 1);
    __builtin_prefetch (&a[32], 0, 2);
    __builtin_prefetch (&a[48], 0, 3);
    __builtin_prefetch (&a[64], 1, 0);
    return 0;
}

compiled with gcc -O3 -mprfchw prefetchw.c -c , :

0000000000000000 <main>:
   0:   48 81 ec b0 02 00 00    sub    $0x2b0,%rsp
   7:   48 8d 44 24 88          lea    -0x78(%rsp),%rax
   c:   0f 18 00                prefetchnta (%rax)
   f:   0f 18 98 80 00 00 00    prefetcht2 0x80(%rax)
  16:   0f 18 90 00 01 00 00    prefetcht1 0x100(%rax)
  1d:   0f 18 88 80 01 00 00    prefetcht0 0x180(%rax)
  24:   0f 0d 88 00 02 00 00    prefetchw 0x200(%rax)
  2b:   31 c0                   xor    %eax,%eax
  2d:   48 81 c4 b0 02 00 00    add    $0x2b0,%rsp
  34:   c3                      retq

If you play with the 2nd argument you'd notice that the hint levels are ignores for prefetchW, since it doesn't support temporal level hints. By the way, if you remove the -mprfchw flag, gcc will convert this into a normal read prefetch (I haven't tried different -march/mattr settings, maybe some of them include it as well).

_Builtin_Prefetch, How Much Does It Read