Is It Necessary to Flush Write Combine Memory Explicitly by Programmer

Is it necessary to flush write combine memory explicitly by programmer?

According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1 (August 2012 version, but this should not have changed), Section 11.3.1, the buffer must be flushed:

The protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for system memory coherency. When using the WC memory type, software must be sensitive to the fact that the writing of data to system memory is being delayed and must deliberately empty the WC buffers when system memory coherency is required.

If the graphics drivers did not actually flush the write combining buffers, then they were depending on system specific timing and/or buffer sizes (while assuming that subsequent WC writes will be allocated to the buffer, this is not architecturally guaranteed). This may work (or appear to work) on existing systems under ordinary workloads, but it is not architecturally guaranteed to work.

Since a broad range of serializing events will flush the write combining buffers, it is quite possible that the flush operation/event is present but not obvious (as an SFENCE would be). From Intel® 64 and IA-32 Architectures Software Developer’s Manual (version 052, September 2014), Volume 3, Section 11.3 Methods of Caching Available:

If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.

For example, a write to a GPU register (if mapped to uncached memory) would flush the write combining buffer.

Flush the write-combining buffer

In an answer to a different StackOverflow question, Paul A. Clayton includes the following quote from Section 11.3 of the Intel manual:

If the WC buffer is partially filled, the writes may be delayed until
the next occurrence of a serializing event; such as, an SFENCE or
MFENCE instruction, CPUID execution, a read or write to uncached
memory, an interrupt occurrence, or a LOCK instruction execution.

Thus, Intel guarantees that either an sfence or an mfence will flush the write-combining buffer.

Why does CLFLUSH exist in x86?

I think the main use-case is Non-volatile DIMMs, especially Intel's Optane DC PM. It's normally mapped WB-cacheable so requires explicit flushes (or movnt) to make sure data is persisted to non-volatile storage.

(But clflush was introduced at the same time as SSE2, back in Pentium 4 days. I don't know what the idea was there; possibly explicit cache control for performance reasons, like the opposite of prefetch.)

Skylake introduced weakly-ordered higher performance CLFLUSHOPT because it's useful for non-volatile storage hooked up to the memory hierarchy directly. Flushing cache makes sure data is written out to actual memory, not still dirty in the CPU.

See also this SuperUser answer for some links and background on Optane DC PM (Persistent Memory). It's non-volatile storage in physical address-space, not just in virtual address space with software tricks.

Dan Luu's article on clwb and pcommit is interesting: the benefits of taking the OS out of the way for access to storage, detailing Intel's plans at that point for clflush / clwb and their memory-ordering semantics. It was written while Intel was still planning to require an instruction called pcommit (persistent commit) as part of this process, but Intel later decided to remove that instruction: Deprecating the PCOMMIT Instruction (from Intel) has some interesting info about why, and how things work under the hood.

It potentially also matters for non-cache-coherent DMA to devices, if anything can do that in x86. (But x86 has always had cache-coherent DMA, since the first x86 CPUs with caches, to avoid breaking existing software.)

Apparently it's not possible to map MMIO / PCIe device memory regions as write-back (WB) cacheable. how to do mmap for cacheable PCIe BAR Maybe P4 architects were considering that future possibility when they introduced it.

In that previous link, Dr. Bandwidth mentions a partial workaround that actually involves needing CLFLUSH to maintain correctness:

map the MMIO range twice -- once for store operations from the processor to the FPGA using the Write-Combining (WC) memory type, and once for reads from the processor to the FPGA using the Write Protect (WP) or Write Through (WT) types. You will need to maintain coherence manually by using CLFLUSH on cache lines in the "read only" region when you write to the alias of that line in the "write only" region.

So it is possible to create a situation where you might need clflush, other than for NV-DIMM.

Are CPU caches flushed to memory during I/O?

On x86 and x64 series, all caches are coherent and you'll get 42 on T2 even if the two threads do not share the same cache unit.

Your thought experiment can be further reduced to 2 cases: the two threads share the same cache unit (multi-core) or do not share (multi-cpu).

When they share a cache unit, both T1 and T2 will use the same cache, therefore they will both see 42 without any synchronization to memory.

In case the caches are not shared (i.e., multi-cpu), the ISA requires that the cache units be synchronized, and that will be transparent to software. Both threads will see 42 at the same address. This synchronization introduces some overhead though, therefore nowadays multi-core design is preferred (beside the reason cache is expensive).

Generating a 64-byte read PCIe TLP from an x86 CPU

Intel has a white-paper on copying from video RAM to main memory; this should be similar but a lot simpler (because the data fits in 2 or 4 vector registers).

It says that NT loads will pull a whole cache-line of data from WC memory into a LFB:

Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.

Use AVX2 _mm256_stream_load_si256() or the SSE4.1/AVX1 128-bit version.

Fill-buffers are a limited resource, so you definitely want the compiler to generate asm that does the two aligned loads of a 64-byte cache-line back to back, then store to regular memory.

If you're doing more than one 64-byte block at a time, see Intel's white-paper for a suggestion on using a small bounce buffer that stays hot in L1d to avoid mixing stores to DRAM with NT loads. (L1d evictions to DRAM, like NT stores, also require line-fill buffers, LFBs).

Note that _mm256_stream_load_si256() is not useful at all on memory types other than WC. The NT hint is ignored on current hardware, but it costs an extra ALU uop anyway vs. a regular load. There is prefetchnta, but that's a totally different beast.

Is there a way to flush the entire CPU cache related to a program?

For links to related questions about clearing caches (especially on x86), see the first answer on WBINVD instruction usage.

No, you cannot do this reliably or efficiently with pure ISO C++. It doesn't know or care about CPU caches. The best you could do is touch a lot of memory so everything else ends up getting evicted¹, but this is not what you're really asking for. (Of course, flushing all cache is by definition inefficient...)
See Flushing the cache to prevent benchmarking fluctiations for some tips about implementation details if you go that route.

CPU cache management functions / intrinsics / asm instructions are implementation-specific extensions to the C++ language. But other than inline asm, no C or C++ implementations that I'm aware of provide a way to flush all cache, rather than a range of addresses. That's because it's not a normal thing to do.

On x86, for example, the asm instruction you're looking for is wbinvd. It writes-back any dirty lines before evicting, unlike invd (which drops cache without write-back, useful when leaving cache-as-RAM mode). So in theory wbinvd has no architectural effect, only microarchitectural, but it's so slow that's it's a privileged instruction. As Intel's insn ref manual entry for wbinvd points out, it will increase interrupt latency, because it is not itself interruptible and may have to wait for 8 MiB or more of dirty L3 cache to be flushed. i.e. delaying interrupts for that long can be considered an architectural effect, unlike most timing effects. It's also complicated on a multi-core system because it has to flush caches for all cores.

I don't think there's any way to use it in user-space (ring 3) on x86. Unlike cli / sti and in/out, it's not enabled by the IO-privilege level (which you can set on Linux with an iopl() system call). So wbinvd only works when actually running in ring 0 (i.e. in kernel code). See Privileged Instructions and CPU Ring Levels.

But if you're writing a kernel (or freestanding program that runs in ring0) in GNU C or C++, you could use asm("wbinvd" ::: "memory");. On a computer running actual DOS, normal programs run in real mode (which doesn't have any lower-privilege levels; everything is effectively kernel). That would be another way to run a microbenchmark that needs to run privileged instructions to avoid kernel<->userspace transition overhead for wbinvd, and also has the convenience of running under an OS so you can use a filesystem. Putting your microbenchmark into a Linux kernel module might be easier than booting FreeDOS from a USB stick or something, though. Especially if you want control of turbo frequency stuff.

The only reason I can think of that you might want this is for some kind of experiment to figure out how the internals of a specific CPU are designed. So the details of exactly how it's done are critical. It doesn't make sense to me to even want a portable / generic way to do this.

Or maybe in a kernel before reconfiguring physical memory layout, e.g. so there's now an MMIO region for an ethernet card where there used to be normal DRAM. But in that case your code is already totally arch-specific.

Normally when you want / need to flush caches for correctness reasons, you know which address range needs flushing. e.g. when writing drivers on architectures with DMA that isn't cache coherent, so write-back happens before a DMA read, and doesn't step on a DMA write. (And the eviction part is important for DMA reads, too: you don't want the old cached value). But x86 has cache-coherent DMA these days, because modern designs build the memory controller into the CPU die so system traffic can snoop L3 on the way from PCIe to memory.

The major case outside of drivers where you need to worry about caches is with JIT code-generation on non-x86 architectures with non-coherent instruction caches. If you (or a JIT library) write some machine code into a char[] buffer and cast it to a function pointer, architectures like ARM don't guarantee that code-fetch will "see" that newly-written data.

This is why gcc provides __builtin__clear_cache. It doesn't necessarily flush anything, only makes sure it's safe to execute that memory as code. x86 has instruction caches that are coherent with data caches and supports self-modifying code without any special syncing instructions. See godbolt for x86 and AArch64, and note that __builtin__clear_cache compiles to zero instructions for x86, but has an effect on surrounding code: without it, gcc can optimize away stores to a buffer before casting to a function pointer and calling. (It doesn't realize that data is being used as code, so it thinks they're dead stores and eliminates them.)

Despite the name, __builtin__clear_cache is totally unrelated to wbinvd. It needs an address-range as args so it's not going to flush and invalidate the entire cache. It also doesn't use use clflush, clflushopt, or clwb to actually write-back (and optionally evict) data from cache.

When you need to flush some cache for correctness, you only want to flush a range of addresses, not slow the system down by flushing all the caches.

It rarely if ever makes sense to intentionally flush caches for performance reasons, at least on x86. Sometimes you can use pollution-minimizing prefetch to read data without as much cache pollution, or use NT stores to write around cache. But doing "normal" stuff and then clflushopt after touching some memory for the last time is generally not worth it in normal cases. Like a store, it has to go all the way through the memory hierarchy to make sure it finds and flushes any copy of that line anywhere.

There isn't a light-weight instruction designed as a performance hint, like the opposite of _mm_prefetch.

The only cache-flushing you can do in user-space on x86 is with clflush / clflushopt. (Or with NT stores, which also evict the cache line if it was hot before hand). Or of course creating conflict evictions for known L1d size and associativity, like writing to multiple lines at multiples of 4kiB which all map to the same set in a 32k / 8-way L1d.

There's an Intel intrinsic _mm_clflush(void const *p) wrapper for clflush (and another for clflushopt), but these can only flush cache lines by (virtual) address. You could loop over all the cache lines in all the pages your process has mapped... (But that can only flush your own memory, not cache lines that are caching kernel data, like the kernel stack for your process or its task_struct, so the first system-call will still be faster than if you had flushed everything).

There's a Linux system call wrapper to portably evict a range of addresses: cacheflush(char *addr, int nbytes, int flags). Presumably the implementation on x86 uses clflush or clflushopt in a loop, if it's supported on x86 at all. The man page says it first appeared in MIPS Linux "but
nowadays, Linux provides a cacheflush() system call on some other
architectures, but with different arguments."

I don't think there's a Linux system call that exposes wbinvd, but you could write a kernel module that adds one.

Recent x86 extensions introduced more cache-control instructions, but still only by address to control specific cache lines. The use-case is for non-volatile memory attached directly to the CPU, such as Intel Optane DC Persistent Memory. If you want to commit to persistent storage without making the next read slow, you can use clwb. But note that clwb is not guaranteed to avoid eviction, it's merely allowed to. It might run the same as clflushopt, like may be the case on SKX.

See https://danluu.com/clwb-pcommit/, but note that pcommit isn't required: Intel decided to simplify the ISA before releasing any chips that need it, so clwb or clflushopt + sfence are sufficient. See https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction.

Anyway, this is the kind of cache-control that's relevant for modern CPUs. Whatever experiment you're doing requires ring0 and assembly on x86.

Footnote 1: Touching a lot of memory: pure ISO C++17

You could maybe allocate a very large buffer and then memset it (so those writes will pollute all the (data) caches with that data), then unmap it. If delete or free actually returns the memory to the OS right away, then it will no longer be part of your process's address space, so only a few cache lines of other data will still be hot: probably a line or two of stack (assuming you're on a C++ implementation that uses a stack, as well as running programs under an OS...). And of course this only pollutes data caches, not instruction caches, and as Basile points out, some levels of cache are private per-core, and OSes can migrate processes between CPUs.

Also, beware that using an actual memset or std::fill function call, or a loop that optimizes to that, could be optimized to use cache-bypassing or pollution-reducing stores. And I also implicitly assumed that your code is running on a CPU with write-allocate caches, instead of write-through on store misses (because all modern CPUs are designed this way). x86 supports WT memory regions on a per-page basis, but mainstream OSes use WB pages for all "normal" memory.

Doing something that can't optimize away and touches a lot of memory (e.g. a prime sieve with a long array instead of a bitmap) would be more reliable, but of course still dependent on cache pollution to evict other data. Just reading large amounts of data isn't reliable either; some CPUs implement adaptive replacement policies that reduce pollution from sequential accesses, so looping over a big array hopefully doesn't evict lots of useful data. E.g. the L3 cache in Intel IvyBridge and later does this.

Is It Necessary to Flush Write Combine Memory Explicitly by Programmer