What Are _Mm_Prefetch() Locality Hints

What are _mm_prefetch() locality hints?

Sometimes intrinsics are better understood in terms of the instruction they represent rather than as the abstract semantic given in their descriptions.

The full set of the locality constants, as today, is

#define _MM_HINT_T0 1
#define _MM_HINT_T1 2
#define _MM_HINT_T2 3
#define _MM_HINT_NTA 0
#define _MM_HINT_ENTA 4
#define _MM_HINT_ET0 5
#define _MM_HINT_ET1 6
#define _MM_HINT_ET2 7

as described in this paper about Intel Xeon Phi coprocessor prefetching capabilities.

For IA32/AMD processors, the set is reduced to

#define _MM_HINT_T0 1
#define _MM_HINT_T1 2
#define _MM_HINT_T2 3
#define _MM_HINT_NTA 0
#define _MM_HINT_ET1 6

_mm_prefetch is compiled into different instructions based on the architecture and the locality hint

    Hint              IA32/AMD          iMC
_MM_HINT_T0           prefetcht0     vprefetch0
_MM_HINT_T1           prefetcht1     vprefetch1
_MM_HINT_T2           prefetcht2     vprefetch2
_MM_HINT_NTA          prefetchnta    vprefetchnta
_MM_HINT_ENTA              -         vprefetchenta
_MM_HINT_ET0               -         vprefetchet0
_MM_HINT_ET1          prefetchwt1    vprefetchet1
_MM_HINT_ET2               -         vprefetchet2

What the (v)prefetch instructions do, if all the requirements are satisfied, is to bring a cache line worth of data into the cache level specified by the locality hint.

The instruction is just a hint, it may be ignored.

When a line is prefetched into level X, the manuals (both Intel and AMD) say that it also fetched into all the other higher level (but for the case X=3).

I'm not sure if this is actually true, I believe that the line is prefetched with-respect-to cache level X and depending on the caching strategies of the higher levels (inclusive vs non-inclusive) it may or may not be present there too.

Another attribute of the (v)prefetch instructions is the non-temporal attribute.

A non-temporal data is unlikely to be reused soon.

In my understanding, NT data is stored in the "streaming load buffers" for the IA32 architecture¹ while for the iMC architecture it is stored in the normal cache (using as the way the hardware thread id) but with Most Recent Use replacement policy (so that it will be the next evicted line if needed).

For AMD the manual read that the actual location is implementation dependent, ranging from a software invisible buffer to a dedicated non-temporal cache.

The last attribute of the (v)prefetch instructions is the "intent" attribute or the "eviction" attribute.

Due to the MESI-and-variant protocols, a Request-for-ownership must be made to bring a line into an exclusive state (in order to modify it).

An RFO is just a special read, so prefetching it with an RFO will bring it into the Exclusive state directly (otherwise the first store to it will cancel the benefits of prefetching due to the "delayed" RFO needed), granted we know we will write to it later.

The IA32 and AMD architectures don't support and exclusive non-temporal hint (yet) since the way the non-temporal cache level is implementation-defined.

The iMC architecture allows for it with the locality code _MM_HINT_ENTA.

¹ Which I understand to be the WC buffers. Peter Cordes clarified this on a comment below: prefetchnta only uses the Line-Fill buffers if prefetching USWC memory regions. Otherwise it prefetches into L1

For reference here is the description of the instructions involved

PREFETCHh
Fetches the line of data from memory that contains the byte specified with the source operand to a location in the
cache hierarchy specified by a locality hint:

• T0 (temporal data)—prefetch data into all levels of the cache hierarchy.

• T1 (temporal data with respect to first level cache misses)—prefetch data into level 2 cache and higher.

• T2 (temporal data with respect to second level cache misses)—prefetch data into level 3 cache and higher, or
an implementation-specific choice.

• NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and
into a location close to the processor, minimizing cache pollution.

PREFETCHWT1
Fetches the line of data from memory that contains the byte specified with the source operand to a location in the
cache hierarchy specified by an intent to write hint (so that data is brought into ‘Exclusive’ state via a request for
ownership) and a locality hint:
• T1 (temporal data with respect to first level cache)—prefetch data into the second level cache.
VPREFETCHh
                 Cache  Temporal    Exclusive state
                 Level
VPREFETCH0       L1     NO          NO
VPREFETCHNTA     L1     YES         NO
VPREFETCH1       L2     NO          NO
VPREFETCH2       L2     YES         NO
VPREFETCHE0      L1     NO          YES
VPREFETCHENTA    L1     YES         YES
VPREFETCHE1      L2     NO          YES
VPREFETCHE2      L2     YES         YES

Scenarios when software prefetching manual instructions are reasonable

The idea of prefetching is based upon these facts:

Accessing memory is very expensive the first time.

The first time a memory address¹ is accessed is must be fetched from memory, it is then stored in the cache hierarchy².
Accessing memory is inherently asynchronous.

The CPU doesn't need any resource from the core to perform the lengthiest part of a load/store³ and thus it can be easily done in parallel with other tasks⁴.

Thanks to the above it makes sense to try a load before it is actually needed so that when the code will actually need the data, it won't have to wait.

It is very worth nothing that the CPU can go pretty far ahead when looking for something to do, but not arbitrarily deep; so sometimes it needs the help of the programmer to perform optimally.

The cache hierarchy is, by its very nature, an aspect of the micro-architecture not the architecture (read ISA). Intel or AMD cannot give strong guarantees on what these instructions do.

Furthermore using them correctly is not easy as the programmer must have clear in mind how many cycles each instruction can take.
Finally, the latest CPU are getting more and more good at hiding memory latency and lowering it.

So in general prefetching is a job for the skilled assembly programmer.

That said the only possible scenario is where the timing of a piece of code must be consistent at every invocation.

For example, if you know that an interrupt handler always update a state and it must perform as fast as possible, it is worth, when setting the hardware that uses such interrupt, to prefetch the state variable.

Regarding the different level of prefetching, my understanding is that different levels (L1 - L4) correspond to different amounts of sharing and polluting.

For example prefetch0 is good if the thread/core that executes the instruction is the same that will read the variable.

However, this will take a line in all the caches, eventually evicting other, possibly useful, lines.
You can use this for example when you know that you'll need the data surely in short.

prefetch1 is good to make the data quickly available for all core or core group (depending on how L2 is shared) without polluting L1.

You can use this if you know that you may need the data or that you'll need it after having done with another task (that takes priority in using the cache).

This is not as fast as having the data in L1 but much better than having it in memory.

prefetch2 can be used to take out most of the memory access latency since it moves the data in the L3 cache.

It doesn't pollute L1 or L2 and it is shared among cores, so it's good for data used by rare (but possible) code paths or for preparing data for other cores.

prefetchnta is the easiest to understand, it is a non-temporal move. It avoids creating an entry in every cache line for a data that is accessed only once.

prefetchw/prefetchwnt1 are like the others but makes the line Exclusive and invalidates other cores lines that alias this one.

Basically, it makes writing faster as it is in the optimal state of the MESI protocol (for cache coherence).

Finally, a prefetch can be done incrementally, first by moving into L3 and then by moving into L1 (just for the threads that need it).

In short, each instruction let you decide the compromise between pollution, sharing and speed of access.

Since these all require to keep track of the use of the cache very carefully (you need to know that it's not worth creating and entry in the L1 but it is in the L2) the use is limited to very specific environments.

In a modern OS, it's not possible to keep track of the cache, you can do a prefetch just to find your quantum expired and your program replaced by another one that evicts the just loaded line.

As for a concrete example I'm a bit out of ideas.

In the past, I had to measure the timing of some external event as consistently as possible.

I used and interrupt to periodically monitor the event, in such case I prefetched the variables needed by the interrupt handler, thereby eliminating the latency of the first access.

Another, unorthodox, use of the prefetching is to move the data into the cache.

This is useful if you want to test the cache system or unmap a device from memory relying on the cache to keep the data a bit longer.

In this case moving to L3 is enough, not all CPU has an L3, so we may need to move to L2 instead.

I understand these examples are not very good, though.

¹ Actually the granularity is "cache lines" not "addresses".

² Which I assume you are familiar with. Shortly put: It, as present, goes from L1 to L3/L4. L3/L4 is shared among cores. L1 is always private per core and shared by the core's threads, L2 usually is like L1 but some model may have L2 shared across pairs of cores.

³ The lengthiest part is the data transfer from the RAM. Computing the address and initializing the transaction takes up resources (store buffer slots and TLB entries for example).

⁴ However any resource used to access the memory can become a critical issue as pointed out by @Leeor and proved by the Linux kernel developer.

What is the purpose of `_mm_clevict` intrinsic and corresponding clevict0, clevict1 instructions?

As far as I can tell, these instructions were added to the 1st generation Xeon Phi (Knights Corner, KNC) processors to help deal with some very specific performance issues for data motion through the cache hierarchy. It has been quite a while since I looked at the details, but my recollection is that there were some performance problems associated with cache victims, and that throughput was improved if the no-longer-needed lines were evicted from the caches before the cache miss that would cause an eviction.

Idea (1): This might have been due to memory bank conflicts on dirty evictions. E.g., consider what would happen if the address mapping made it too likely that the new item being loaded would be located in a DRAM bank that conflicted with the victim to be discarded. If there were not enough write buffers at the memory controller, the writeback might have to be committed to DRAM before the DRAM could switch banks to service the read. (Newer processors have lots and lots of write buffers in the memory controller, so this is not a problem, but this could have been a problem for KNC.)

Idea (2): Another possibility is that the cache victim processing could delay the read of the new value because of serialization at the Duplicate Tag Directories (DTDs). The coherence protocol was clearly a bit of a "hack" (so that Intel could use the existing P54C with minimal changes), but the high-level documentation Intel provided was not enough to understand the performance implications of some of the implementation details.

The CLEVICT instructions were "local" -- only the core executing the instruction performed the eviction. Dirty cache lines would be written out and locally invalidated, but the invalidation request would not be transmitted to other cores. The instruction set architecture documentation does not comment on whether the CLEVICT instruction results in an update message from the core to the DTD. (This would be necessary for idea (2) to make any change in performance.)

The CLDEMOTE instruction appears to be intended to reduce the latency of cache-to-cache transfers in producer-consumer situations. From the instruction description:
"This may accelerate subsequent accesses to the line by other cores in the same coherence domain, especially if the line was written by the core that demotes the line."
This is very similar to my patent https://patents.google.com/patent/US8099557B2/ "Push
for sharing instruction" (developed while I was at AMD).

How to use software prefetch systematically?

How can I see that software prefetch can help for my specific program?

You can check the proportion of cache misses. perf or VTune can be used to get this information thanks to hardware performance counters. You can get the list with perf list for example. The list is dependent of the target processor architecture but there are some generic events. For example, L1-dcache-load-misses, LLC-load-misses and LLC-store-misses. Having the amount of cache misses is not very useful unless you also get the number of load/store. There are generic counters like L1-dcache-loads, LLC-loads or LLC-stores. AFAIK, for the L2, there is no generic counters (at least on Intel processors) and you need to use specific hardware counters (for example l2_rqsts.miss on Intel Skylake-like processors). To get the overall statistics, you can use perf stat -e an_hardware_counter,another_one your_program. A good documentation can be found here.

When the proportion of misses is big, then you should try to optimize the code, but this is just a hint. In fact, regarding your application, you can have a lot of cache hit but many cache misses in critical part/time of your application. As a result, cache misses can be lost among all the others. This is especially true for the L1 cache references that are massive in scalar codes compared to SIMD ones. One solution is to profile only specific portion of your application and use the knowledge of it so to investigate in the good direction. Performance counters are not really a tool to automatically search issues in your program, but a tool to assist you in validating/disproving some hypothesis or to give some hints about what is happening. It gives you evidences to solve a mysterious case but it is up to you, the detective, to do all the work.

How I can locate the loads that suffer from cache misses the most?

Some hardware performance counters are "precise" meaning that the instruction that has generated the event can be located. This is very useful since you can tell which instructions are responsible for the most cache misses (though it is not always precise in practice). You can use perf record + perf report so to get the information (see the previous tutorial for more information).

Note that there are many reasons that can cause a cache misses and only few cases can be solved by using software prefetching.

How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?

This is often difficult to choose in practice and very dependent of your application. Theoretically, the number is an hint to tell to the processor if the level of locality of the target cache line (eg. fetched into the L1, L2 or L3 cache). For example, if you know that data should be read and reused soon, it is a good idea to put it in the L1. However, if the L1 is used and you do not want to pollute it with data used only once (or rarely used), it is better to fetch data into lower caches. In practice, it is a bit complex since the behavior may not be the same from one architecture to another... See What are _mm_prefetch() locality hints? for more information.

An example of usage is for this question. Software prefetching was used to avoid cache trashing issue with some specific strides. This is a pathological case where the hardware prefetcher is not very useful.

Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch?

This is clearly the most tricky part. You should prefetch the cache lines sufficiently early so for the latency to be significantly reduced, otherwise the instruction is useless and can actually be detrimental. Indeed, the instruction takes some space in the program, need to be decoded, and use load ports that could be used to execute other (more critical) load instructions for example. However, if it is too late, then the cache line can be evicted and need to be reloaded...

The usual solution is to write a code like this:

for (int i = 0; i < n; i++) {
   // some code
   const size_t magic_distance_guess = 200;
   __builtin_prefetch(&data[i+magic_distance_guess]);
   double x = a[i];
   // some code
}

Where magic_distance_guess is a value generally set based on benchmarks (or a very deep understanding of the target platform though the practice often shows even highly-skilled developers fail to find the best value).

The thing is the latency is very dependent of where data are coming from and the target platform. In most case, developers cannot really know exactly when to do the prefetching unless they work on a unique given target platform. This makes software prefetching tricky to use and often detrimental when the target platform changes (one has to consider the maintainability of the code and the overhead of the instruction). Not to mention that built-ins are compiler-dependent, prefetching intrinsics are architecture-dependent and there is no standard portable way to use software prefetching.

Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache?

Yes, prefetching instructions are not free and so it is better to use only 1 instruction per cache line (as other prefetching instruction on the same cache line will be useless).

Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?

This is very dependent of the target platform. Modern mainstream x86-64 processors execute instructions in an out-of-order way in parallel and they have a pretty huge window of instruction analyzed. They tends to execute load as soon as possible so to avoid misses and they are often very good for such job.

In your example loop, I expect the hardware prefetcher should do a very good job and using software prefetching should be slower on a (relatively recent) mainstream processor.

Software prefetching was useful when hardware prefetchers was not very smart a decade ago but they tends to be very good nowadays. Additionally, it is often better to guide hardware prefetchers than to use software prefetching instructions since the former have a lower overhead. This is why software prefetching is discouraged (eg. by Intel and most developers) unless you really know what you are doing.

Is _mm_prefetch asynchronous? Profiling shows a lot of cycles on it

Substantially everything on an out-of-order CPU is "asynchronous" in the way you describe (really, running in parallel and out of order). In that sense, prefetch isn't really different than regular loads, which can also run out of order or "async" with other instructions.

Once that is understood, the exact behavior of prefetch is hardware dependent, but it is my observation that:

On Intel, prefetch instructions can retire before their data arrives. So a prefetch instruction that successfully begins execution won't block the CPU pipeline after that. However, note "successfully executes": the prefetch instruction still requires a line fill buffer (MSHR) if it misses in L1 and on Intel it will wait for that resource if not available. So if you issue a lot of prefetch misses in parallel, they end up waiting for fill buffers which makes them act quite similarly to vanilla loads in that scenario.
On AMD Zen [2], prefetches do not wait for a fill buffer if none is available. Presumably, the prefetch is simply dropped. So a large number of prefetch misses behave quite differently than Intel: they will complete very quickly, regardless if they miss or not, but many of the associated lines will not actually be fetched.

How to properly use prefetch instructions?

The short answer that probably works for perfectly linear streaming loops like yours is probably: don't use them at all, let the hardware prefetchers do the work.

Still, it's possible that you can speed things up with software prefetching, and here is the theory and some detail if you want to try...

Basically you call _mm_prefetch() on an address you'll need at some point in the future. It is similar in some respects to loading a value from memory and doing nothing with it: both bring the line into the L1 cache², but the prefetch intrinsic, which under the covers is emitting specific prefetch instructions, has some advantages which make it suitable for prefetching.

It works at cache-line granularity¹: you only need to issue one prefetch for each cache line: more is just a waste. That means that in general, you should try to unroll your loop enough so that you can issue only one prefetch per cache line. In the case of 16-byte __m128 values, that means unroll at least by 4 (which you've done, so you are good there).

Then simple prefetch each of your access streams by some PF_DIST distance ahead of the current calculation, something like:

for(size_t i=0; i<1048576;i+=4) {
    dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
    dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
    dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
    dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
    _mm_prefetch(A + i + PF_A_DIST, HINT_A);
    _mm_prefetch(B + i + PF_B_DIST, HINT_B);
}

Here PF_[A|B]_DIST is the distance to prefetch ahead of the current iteration and HINT_ is the temporal hint to use. Rather than try to calculate the right distance value from first principles, I would simply determine good values of PF_[A|B]_DIST experimentally⁴. To reduce the search space, you can start by setting them both equal, since logically a similar distance is likely to be ideal. You might find that only prefetching one of the two streams is ideal.

It is very important that the ideal PF_DIST depends on the hardware configuration. Not just on the CPU model, but also on the memory configuration, including details such as the snooping mode for multi-socket systems. For example, the best value could be wildly different on client and server chips of the same CPU family. So you should run your tuning experiment on the actual hardware you are a targeting, as much as possible. If you target a variety of hardware, you can test on all the hardware and hopefully find a value that's good on all of them, or even consider compile-time or runtime dispatching depending on CPU type (not always enough, as above) or based on a runtime test. Now just relying on hardware prefetching is starting to sound a lot better, isn't it?

You can use the same approach to find the best HINT since the search space is small (only 4 values to try) - but here you should be aware than the difference between the different hints (particularly _MM_HINT_NTA) might only show as a performance difference in code that runs after this loop, since they affect how much data unrelated to this kernel remain in the cache.

You might also find that this prefetching doesn't help at all, since your access patterns are perfectly linear and likely to be handled well by the L2 stream prefetchers. Still there are some additional, more hardcode things you could try or consider:

You might investigate whether prefetching only at the start of 4K page boundaries helps³. This will complicate your loop structure: you'll probably need a nested loop to separate the "near edge of page" and "deep inside the page" cases in order to only issue the prefetches near page boundaries. You'll also want to make your input arrays page-aligned too, or else it gets even more complicated.
You can try disabling some/all of the hardware prefetchers. This is usually terrible for overall performance, but on a highly tuned load with software prefetching, you might see better performance by eliminating interference from hardware prefetching. Selecting disabling prefetching also gives you an important a key tool to help understand what's going on, even if you ultimately leave all the prefetchers enabled.
Make sure you are using huge pages, since for large contiguous blocks like this they are idea.
There are problems with prefetching at the beginning and end of your main calculation loop: at the start, you'll miss prefetching all data at the start of each array (within the initial PF_DIST window), and at the end of the loop you'll prefetch additional and PF_DIST beyond the end of your array. At best these waste fetch and instruction bandwidth, but they may also cause (ultimately discarded) page faults which may affect performance. You can fix both by special intro and outro loops to handle these cases.

I also highly recommend the 5-part blog post Optimizing AMD Opteron Memory Bandwidth, which describes optimizing a problem very similar to yours, and which covers prefetching in some detail (it gave a large boost). Now this is totally different hardware (AMD Opteron) which likely behaves differently to more recent hardware (and especially to Intel hardware if that's what you're using) - but the process of improvement is key and the author is an expert in the field.

¹ It may actually work at something like 2-cache-line granularity depending on how it interacts with the adjacent cache line prefetcher(s). In this case, you may be able to get away with issuing half the number of prefetches: one every 128 bytes.

² In the case of software prefetch, you can also select some other level of cache, using the temporal hint.

³ There is some indication that even with perfect streaming loads, and despite the presence of "next page prefetchers" in modern Intel hardware, page boundaries are still a barrier to hardware prefetching that can be partially alleviated by software prefetching. Maybe because software prefetch serves as a stronger hint that "Yes, I'm going to read into this page", or because software prefetch works at the virtual address level and necessarily involves the translation machinery, while L2 prefetching works at the physical level.

⁴ Note that the "units" of the PF_DIST value is sizeof(__mm128), i.e., 16 bytes due to the way I calculated the address.

What Are _Mm_Prefetch() Locality Hints