Do Current X86 Architectures Support Non-Temporal Loads (From "Normal" Memory)

What is the meaning of non temporal memory accesses in x86

Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion.

When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.

For this and similar situations, processors provide support for non-temporal write operations. Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.

Source: http://lwn.net/Articles/255364/

Why is PREFETCHNTA qualified by must be writeback memory type ?

In a user-space process under a mainstream OS, all your memory will be WB (Write Back) cacheable.

Unless you use special system calls to do something like mapping video RAM into your virtual address space. If you aren't doing that, you definitely have write-back memory.

All discussion of other memory types in other answers is just for completeness / to avoid saying things that aren't true in all cases. Or to explain what stuff like SSE4.1 movntdqa NT load is actually for. It's useless on WB memory (on current hardware).

(NT prefetch is very different from NT load.)

Can we use non-temporal mov instructions on heap memory?

You can use NT stores like movntps on normal WB memory (i.e. the heap). See also Enhanced REP MOVSB for memcpy for more about NT stores vs. normal stores.

It treats it as WC for the purposes of those NT stores, despite the MTRR and/or PAT having it set to normal WB.

The Intel docs are telling you that NT stores "work" on WB, WT, and WC memory. (But not strongly-ordered UC uncacheable memory, and of course not on WP write-protected memory).

You are correct that normally only video RAM (or possibly other similar device-memory regions) are mapped WC. And no, you can't easily allocate WC memory in a user-space process under a normal OS like Linux, but you wouldn't normally want to.

You can only use SSE4 NT loads on WC memory (otherwise current CPUs ignore the NT hint), but some cache pollution for loads is a small price to pay for HW prefetch and caching working. You can use NT prefetch from WB memory to reduce pollution in some levels of cache, e.g. bypassing L2. But that's hard to tune.

IIRC, normal stores like mov on WC memory have the store-merging behaviour you get from NT stores. But you don't need to use WC memory for NT stores to work.

x86 Non-Temporal Instructions: Is fencing ever needed for thread-local data?

Yes, they will be visible without fences. See section 8.2.2 Memory Ordering in P6 and More Recent Processor Families in the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1 which says, among others:

for memory regions defined as write-back cacheable, [...]
Reads may be reordered with older writes to different locations but
not with older writes to the same location.

and

Writes to memory are not reordered with other writes, with the
following exceptions:
-- streaming stores (writes) executed with the non-temporal move instructions
(MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD);

Does anyone have an example where _mm256_stream_load_si256 (non-tempral load to bypasse cache) actually improves performance?

stream_load (vmovntdqa) is just a slower version of normal load (extra ALU uop) unless you use it on a WC memory region (uncacheable, write-combining).

The non-temporal hint is ignored by current CPUs, because unlike NT stores, the instruction doesn't override the memory ordering semantics. We know that's true on Intel CPUs, and your test results suggest the same is true on AMD.

Its purpose is for copying from video RAM back to main memory, as in an Intel whitepaper. It's useless unless you're copying from some kind of uncacheable device memory. (On current CPUs).

See also What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region? for more details. As my answer there points out, what can sometimes help if tuned carefully for your hardware and workload, is NT prefetch to reduce cache pollution. But tuning the prefetch distance is pretty brittle; too far and data will be fully evicted by the time you read it, instead of just missing L1 and hitting in L2.

There wouldn't be much if anything to gain in bandwidth anyway. Normal stores cost a read + an eventual write on eviction for each cache line. The Read For Ownership (RFO) is required for cache coherency, and because of how write-back caches work that only track dirty status on a whole-line basis. NT stores can increase bandwidth by avoiding those loads.

But plain loads aren't wasting anything, the only downside is evicting other data as you loop over huge arrays generating boatloads of cache misses, if you can't change your algorithm to have any locality.

If cache-blocking is possible for your algorithm, there's much more to gain from that, so you don't just bottleneck on DRAM bandwidth. e.g. do multiple steps over a subset of your data, then move on to the next.

See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - most of it; go read Ulrich Drepper's paper.

Anything you can do to increase computational intensity helps (ALU work per time the data is loaded into L1d cache, or into registers).

Even better, make a custom loop that combines multiple steps that you were going to do on each element. Avoid stuff like for(i) A[i] = sqrt(B[i]) if there is an earlier or later step that also does something simple to each element of the same array.

If you're using NumPy or something, and just gluing together optimized building blocks that operate on large arrays, it's kind of expected that you'll bottleneck on memory bandwidth for algorithms with low computational intensity (like STREAM add or triad type of things).

If you're using C with intrinsics, you should be aiming higher. You might still bottleneck on memory bandwidth, but your goal should be to saturate the ALUs, or at least bottleneck on L2 cache bandwidth.

Sometimes it's hard, or you haven't gotten around to all the optimizations on your TODO list that you can think of, so NT stores can be good for memory bandwidth if nothing is going to re-read this data any time soon. But consider that a sign of failure, not success. CPUs have large fast caches, use them.

What happens with a non-temporal store if the data is already in cache?

All of the behaviors you describe are sensible implementations of a non-temporal store. In practice, on modern x86 CPUs, the actual semantics are that there's no effect on the L1 cache but the L2 (and higher-level caches, if any) will not evict a cache line to store the non-temporal fetch results.

There is no data race because the caches are hardware coherent. This coherence is not effected in any way by the decision to evict a cache line.

Does `xchg` encompass `mfence` assuming no non-temporal instructions?

Assuming you're not writing a device-driver (so all the memory is Write-Back, not weakly-ordered Write-Combining), then yes xchg is as strong as mfence.

NT stores are fine.

I'm sure that this is the case on current hardware, and fairly sure that this is guaranteed by the wording in the manuals for all future x86 CPUs. xchg is a very strong full memory barrier.

Hmm, I haven't looked at prefetch instruction reordering. That might possibly be relevant for performance, or possibly even correctness in weird device-driver situations (where you're using cacheable memory when you probably shouldn't be).

From your quote:

(P4/Xeon) Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

That's the one thing that makes xchg [mem] weaker then mfence (on Pentium4? Probably also on Sandybridge-family).

mfence does guarantee that, which is why Skylake had to strengthen it to fix an erratum. (Are loads and stores the only instructions that gets reordered?, and also the answer you linked on Does lock xchg have the same behavior as mfence?)

NT stores are serialized by xchg / lock, it's only weakly-ordered loads that may not be serialized. You can't do weakly-ordered loads from WB memory. movntdqa xmm, [mem] on WB memory is still strongly-ordered (and on current implementations, also ignores the NT hint instead of doing anything to reduce cache pollution).

It looks like xchg performs better for seq-cst stores than mov+mfence on current CPUs, so you should use that in normal code. (You can't accidentally map WC memory; normal OSes will always give you WB memory for normal allocations. WC is only used for video RAM or other device memory.)

These guarantees are specified in terms of specific families of Intel microarchitectures. It would be nice if there was some common "baseline x86" guarantees that we could assume for future Intel and AMD CPUs.

I assume but haven't checked that the xchg vs. mfence situation is the same on AMD. I'm sure there's no correctness problem with using xchg as a seq-cst store, because that's what compilers other than gcc actually do.