How Does _Builtin_Clear_Cache Work

How does builtin_clear_cache work?

It is just emitting some weird machine instruction[s] on target processors requiring them (x86 don't need that).

Think of __builtin___clear_cache as a "portable" (to GCC and compatible compilers) way to flush the instruction cache (e.g. in some JIT library).

In practice, how could one find what the right begin and end to use?

To be safe, I would use that on some page range (e.g. obtained with sysconf(_SC_PAGESIZE)....), so usually a 4Kbyte aligned memory range (multiple of 4Kbyte). Otherwise, you want some target specific trick to find the cache line width...

On Linux, you might read /proc/cpuinfo and use the cache_alignment & cache_size lines to get a more precise cache line size and alignment.

BTW, a code using __builtin__clear_cache is very likely to be (for other reasons) target machine specific, so it has or knows some machine parameters (and that should include cache size & alignment).

How to call clear cache on Android Native ARM64?

GCC provide this built-in function

__builtin___clear_cache (void* start, void* end)

that is automatically managed according to the architecture.

Parameters set the range of memory to cache, where start is inclusive and end is exclusive. Every time a new memory area with instructions to execute, the cache should be cleared for that area.

reading

ARM __clear_cache equivalent for iOS devices

As iOS is a *NIX-based platform, and you can compile code with apple's version of GCC (LLVM-GCC 4.2), you should just be able to make a function call to __clear_cache(), like this:

extern void __clear_cache(char *beg, char *end);

__clear_cache(beg, end);

Note that this will NOT work with Apple LLVM Compiler 3.1, only with GCC for some odd reason.

win32 c++ acces violation when executing shellcode

rwx at ...80000 /

rw- at ...80010 - those are in the same page.

x86 memory protection works at page granularity (4096B), so you just changed the whole page to non-executable, including the RWXBuff.

The end of the ...80000 page is 80FFF, and the start of the next is ...81000.

(12 offset-within-page address bits = 3 hex digits.)

Note that MSVC has a FlushInstructionCache function, like GNU C's __builtin____clear_cache(). In GNU C, that's not optional even on x86, despite x86 having coherent instruction cache. The reason you need it for safety after storing machine-code bytes into a buffer is so the compiler knows those stores can't be optimized away (or reordered) if it doesn't see any data accesses to them.

MSVC is the same: you need to call FlushInstructionCache after C assignment of machine code bytes to an array, before dereffing a function-pointer to that array.

It will often work without on x86 (including x86-64), unlike on ARM and other ISAs without coherent I-cache, but it's possible for it not to, especially if you assign different code byte to the buffer and call it again. (That's likely to tempt the compiler into eliminating the dead store and only doing the later store. At least if it's sure the memory is private to this function, not globally reachable from whatever unknown function it's calling.)

Is there a way to flush the entire CPU cache related to a program?

For links to related questions about clearing caches (especially on x86), see the first answer on WBINVD instruction usage.

No, you cannot do this reliably or efficiently with pure ISO C++. It doesn't know or care about CPU caches. The best you could do is touch a lot of memory so everything else ends up getting evicted¹, but this is not what you're really asking for. (Of course, flushing all cache is by definition inefficient...)
See Flushing the cache to prevent benchmarking fluctiations for some tips about implementation details if you go that route.

CPU cache management functions / intrinsics / asm instructions are implementation-specific extensions to the C++ language. But other than inline asm, no C or C++ implementations that I'm aware of provide a way to flush all cache, rather than a range of addresses. That's because it's not a normal thing to do.

On x86, for example, the asm instruction you're looking for is wbinvd. It writes-back any dirty lines before evicting, unlike invd (which drops cache without write-back, useful when leaving cache-as-RAM mode). So in theory wbinvd has no architectural effect, only microarchitectural, but it's so slow that's it's a privileged instruction. As Intel's insn ref manual entry for wbinvd points out, it will increase interrupt latency, because it is not itself interruptible and may have to wait for 8 MiB or more of dirty L3 cache to be flushed. i.e. delaying interrupts for that long can be considered an architectural effect, unlike most timing effects. It's also complicated on a multi-core system because it has to flush caches for all cores.

I don't think there's any way to use it in user-space (ring 3) on x86. Unlike cli / sti and in/out, it's not enabled by the IO-privilege level (which you can set on Linux with an iopl() system call). So wbinvd only works when actually running in ring 0 (i.e. in kernel code). See Privileged Instructions and CPU Ring Levels.

But if you're writing a kernel (or freestanding program that runs in ring0) in GNU C or C++, you could use asm("wbinvd" ::: "memory");. On a computer running actual DOS, normal programs run in real mode (which doesn't have any lower-privilege levels; everything is effectively kernel). That would be another way to run a microbenchmark that needs to run privileged instructions to avoid kernel<->userspace transition overhead for wbinvd, and also has the convenience of running under an OS so you can use a filesystem. Putting your microbenchmark into a Linux kernel module might be easier than booting FreeDOS from a USB stick or something, though. Especially if you want control of turbo frequency stuff.

The only reason I can think of that you might want this is for some kind of experiment to figure out how the internals of a specific CPU are designed. So the details of exactly how it's done are critical. It doesn't make sense to me to even want a portable / generic way to do this.

Or maybe in a kernel before reconfiguring physical memory layout, e.g. so there's now an MMIO region for an ethernet card where there used to be normal DRAM. But in that case your code is already totally arch-specific.

Normally when you want / need to flush caches for correctness reasons, you know which address range needs flushing. e.g. when writing drivers on architectures with DMA that isn't cache coherent, so write-back happens before a DMA read, and doesn't step on a DMA write. (And the eviction part is important for DMA reads, too: you don't want the old cached value). But x86 has cache-coherent DMA these days, because modern designs build the memory controller into the CPU die so system traffic can snoop L3 on the way from PCIe to memory.

The major case outside of drivers where you need to worry about caches is with JIT code-generation on non-x86 architectures with non-coherent instruction caches. If you (or a JIT library) write some machine code into a char[] buffer and cast it to a function pointer, architectures like ARM don't guarantee that code-fetch will "see" that newly-written data.

This is why gcc provides __builtin__clear_cache. It doesn't necessarily flush anything, only makes sure it's safe to execute that memory as code. x86 has instruction caches that are coherent with data caches and supports self-modifying code without any special syncing instructions. See godbolt for x86 and AArch64, and note that __builtin__clear_cache compiles to zero instructions for x86, but has an effect on surrounding code: without it, gcc can optimize away stores to a buffer before casting to a function pointer and calling. (It doesn't realize that data is being used as code, so it thinks they're dead stores and eliminates them.)

Despite the name, __builtin__clear_cache is totally unrelated to wbinvd. It needs an address-range as args so it's not going to flush and invalidate the entire cache. It also doesn't use use clflush, clflushopt, or clwb to actually write-back (and optionally evict) data from cache.

When you need to flush some cache for correctness, you only want to flush a range of addresses, not slow the system down by flushing all the caches.

It rarely if ever makes sense to intentionally flush caches for performance reasons, at least on x86. Sometimes you can use pollution-minimizing prefetch to read data without as much cache pollution, or use NT stores to write around cache. But doing "normal" stuff and then clflushopt after touching some memory for the last time is generally not worth it in normal cases. Like a store, it has to go all the way through the memory hierarchy to make sure it finds and flushes any copy of that line anywhere.

There isn't a light-weight instruction designed as a performance hint, like the opposite of _mm_prefetch.

The only cache-flushing you can do in user-space on x86 is with clflush / clflushopt. (Or with NT stores, which also evict the cache line if it was hot before hand). Or of course creating conflict evictions for known L1d size and associativity, like writing to multiple lines at multiples of 4kiB which all map to the same set in a 32k / 8-way L1d.

There's an Intel intrinsic _mm_clflush(void const *p) wrapper for clflush (and another for clflushopt), but these can only flush cache lines by (virtual) address. You could loop over all the cache lines in all the pages your process has mapped... (But that can only flush your own memory, not cache lines that are caching kernel data, like the kernel stack for your process or its task_struct, so the first system-call will still be faster than if you had flushed everything).

There's a Linux system call wrapper to portably evict a range of addresses: cacheflush(char *addr, int nbytes, int flags). Presumably the implementation on x86 uses clflush or clflushopt in a loop, if it's supported on x86 at all. The man page says it first appeared in MIPS Linux "but
nowadays, Linux provides a cacheflush() system call on some other
architectures, but with different arguments."

I don't think there's a Linux system call that exposes wbinvd, but you could write a kernel module that adds one.

Recent x86 extensions introduced more cache-control instructions, but still only by address to control specific cache lines. The use-case is for non-volatile memory attached directly to the CPU, such as Intel Optane DC Persistent Memory. If you want to commit to persistent storage without making the next read slow, you can use clwb. But note that clwb is not guaranteed to avoid eviction, it's merely allowed to. It might run the same as clflushopt, like may be the case on SKX.

See https://danluu.com/clwb-pcommit/, but note that pcommit isn't required: Intel decided to simplify the ISA before releasing any chips that need it, so clwb or clflushopt + sfence are sufficient. See https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction.

Anyway, this is the kind of cache-control that's relevant for modern CPUs. Whatever experiment you're doing requires ring0 and assembly on x86.

Footnote 1: Touching a lot of memory: pure ISO C++17

You could maybe allocate a very large buffer and then memset it (so those writes will pollute all the (data) caches with that data), then unmap it. If delete or free actually returns the memory to the OS right away, then it will no longer be part of your process's address space, so only a few cache lines of other data will still be hot: probably a line or two of stack (assuming you're on a C++ implementation that uses a stack, as well as running programs under an OS...). And of course this only pollutes data caches, not instruction caches, and as Basile points out, some levels of cache are private per-core, and OSes can migrate processes between CPUs.

Also, beware that using an actual memset or std::fill function call, or a loop that optimizes to that, could be optimized to use cache-bypassing or pollution-reducing stores. And I also implicitly assumed that your code is running on a CPU with write-allocate caches, instead of write-through on store misses (because all modern CPUs are designed this way). x86 supports WT memory regions on a per-page basis, but mainstream OSes use WB pages for all "normal" memory.

Doing something that can't optimize away and touches a lot of memory (e.g. a prime sieve with a long array instead of a bitmap) would be more reliable, but of course still dependent on cache pollution to evict other data. Just reading large amounts of data isn't reliable either; some CPUs implement adaptive replacement policies that reduce pollution from sequential accesses, so looping over a big array hopefully doesn't evict lots of useful data. E.g. the L3 cache in Intel IvyBridge and later does this.

How Does _Builtin_Clear_Cache Work