How to Disable CPU Cache (L1/L2) on a Linux System

Disable CPU caches (L1/L2) on ARMv8-A Linux

In general, this is unworkable, for several reasons.

Firstly, clearing the SCTLR.C bit only makes all data accesses non-cacheable. and prevents allocating into any caches. Any data in the caches is still there in the caches, especially dirty lines from anything recently-written; consider what happens when your function returns and the caller tries to restore a stack frame which doesn't even exist in the memory it's now accessing.

Secondly, there are very few uniprocessor ARMv8 systems; assuming you're running SMP Linux, and suddenly disable the caches on just whichever CPU the module loader happened to be scheduled on, then even disregarding the first point things are going to go downhill very fast. Linux expects all CPUs to be coherent with each other, and will typically become very broken very rapidly if that assumption is violated. Note that it's not even worth venturing into SMP cross-calling for this; suffice to say the only safe way to even attempt to run Linux with caches disabled is to make sure they are never enabled to begin with, except...

Thirdly, there is no guarantee Linux will even run with caches disabled. On current hardware, all of the locking and atomic operations in the kernel (not to mention userspace) rely on the exclusive access instructions. Whilst the CPU cluster(s) will implement the architecturally-required local and global exclusive monitors for cacheable memory (usually as part of the cache machinery itself), it is dependent on the system whether a global exclusive monitor for non-cacheable accesses is implemented, as such a thing must be external to the CPU (usually in the interconnect or memory controller). Many systems don't implement such a global monitor, in which case exclusive accesses to external memory may fault, do nothing, or other various implementation-defined behaviours which will result in Linux crashing or deadlocking. It is effectively impossible to run Linux with the cache off on such a system - the amount of hacking just to get a UP arm64 kernel to work (SMP would be literally impossible) would be impractical alone; good luck with userspace.

As it happens, though, the worst problem is none of that, it's this:

...in order to measure performance of optimized code, independent of cache access.

If the code is intended to run in deployment with caches disabled, then logically it can't be intended to run under Linux, therefore the effort spent in hacking up Linux would be better spent on benchmarking in a more realistic execution environment so that results are actually representative. On the other hand, if it is intended to run with caches enabled (under Linux or any other OS), then benchmarking with caches disabled will give meaningless results and be a waste of time. "Optimising" for e.g. an instruction-fetch-bandwidth bottleneck which won't exist in practice is not going to lead you in the right direction.

How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?

The Intel's manual 3A, Section 11.5.3, provides an algorithm to globally disable the caches:

11.5.3 Preventing Caching

To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills, perform the following steps:

Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag to 0.

Flush all caches using the WBINVD instruction.

Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the uncached memory
type (see the discussion of the discussion of the TYPE field and the E flag in Section 11.11.2.1,
“IA32_MTRR_DEF_TYPE MSR”).

The caches must be flushed (step 2) after the CD flag is set to ensure system memory coherency. If the caches are
not flushed, cache hits on reads will still occur and data will be read from valid cache lines.

The intent of the three separate steps listed above addresses three distinct requirements: (i) discontinue new data
replacing existing data in the cache (ii) ensure data already in the cache are evicted to memory, (iii) ensure subsequent memory references observe UC memory type semantics. Different processor implementation of caching
control hardware may allow some variation of software implementation of these three requirements. See note below.

NOTES
Setting the CD flag in control register CR0 modifies the processor’s caching behaviour as indicated
in Table 11-5, but setting the CD flag alone may not be sufficient across all processor families to
force the effective memory type for all physical memory to be UC nor does it force strict memory
ordering, due to hardware implementation variations across different processor families. To force
the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either
program the MTRRs for all physical memory to be UC memory type or disable all MTRRs.

For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been
executed, the cache lines containing the code between the end of the WBINVD instruction and
before the MTRRS have actually been disabled may be retained in the cache hierarchy. Here, to remove code from the cache completely, a second WBINVD instruction must be executed after the
MTRRs have been disabled.

That's a long quote but it boils down to this code

;Step 1 - Enter no-fill mode
mov eax, cr0
or eax, 1<<30        ; Set bit CD
and eax, ~(1<<29)    ; Clear bit NW
mov cr0, eax

;Step 2 - Invalidate all the caches
wbinvd

;All memory accesses happen from/to memory now, but UC memory ordering may not be enforced still.  

;For Atom processors, we are done, UC semantic is automatically enforced.

xor eax, eax
xor edx, edx
mov ecx, IA32_MTRR_DEF_TYPE    ;MSR number is 2FFH
wrmsr

;P4 only, remove this code from the L1I
wbinvd

most of which is not executable from user mode.

AMD's manual 2 provides a similar algorithm in section 7.6.2

7.6.2 Cache Control Mechanisms

The AMD64 architecture provides a number of mechanisms for controlling the cacheability of memory. These are described in the following sections.

Cache Disable. Bit 30 of the CR0 register is the cache-disable bit, CR0.CD. Caching is enabled
when CR0.CD is cleared to 0, and caching is disabled when CR0.CD is set to 1. When caching is
disabled, reads and writes access main memory.

Software can disable the cache while the cache still holds valid data (or instructions). If a read or write
hits the L1 data cache or the L2 cache when CR0.CD=1, the processor does the following:

Writes the cache line back if it is in the modified or owned state.

Invalidates the cache line.

Performs a non-cacheable main-memory access to read or write the data.

If an instruction fetch hits the L1 instruction cache when CR0.CD=1, some processor models may read
the cached instructions rather than access main memory. When CR0.CD=1, the exact behavior of L2
and L3 caches is model-dependent, and may vary for different types of memory accesses.

The processor also responds to cache probes when CR0.CD=1. Probes that hit the cache cause the
processor to perform Step 1. Step 2 (cache-line invalidation) is performed only if the probe is
performed on behalf of a memory write or an exclusive read.

Writethrough Disable. Bit 29 of the CR0 register is the not writethrough disable bit, CR0.NW. In
early x86 processors, CR0.NW is used to control cache writethrough behavior, and the combination of
CR0.NW and CR0.CD determines the cache operating mode.

[...]

In implementations of the AMD64 architecture, CR0.NW is not used to qualify the cache operating
mode established by CR0.CD.

This translates to this code (very similar to the Intel's one):

;Step 1 - Disable the caches
mov eax, cr0
or eax, 1<<30
mov cr0, eax

;For some models we need to invalidated the L1I
wbinvd

;Step 2 - Disable speculative accesses
xor eax, eax
xor edx, edx
mov ecx, MTRRdefType  ;MSR number is 2FFH
wrmsr

Caches can also be selectively disabled at:

Page level, with the attribute bits PCD (Page Cache Disable) [Only for Pentium Pro and Pentium II].

When both are clear the MTTR of relevance is used, if PCD is set the aching
Page level, with the PAT (Page Attribute Table) mechanism.

By filling the IA32_PAT with caching types and using the bits PAT, PCD, PWT as a 3-bit index it's possible to select one the six caching types (UC-, UC, WC, WT, WP, WB).
Using the MTTRs (fixed or variable).

By setting the caching type to UC or UC- for specific physical areas.

Of these options only the page attributes can be exposed to user mode programs (see for example this).

How can i disable cpu cache for certain memory regions?

That comment about non-caching doesn't mean what you think it means, and where it is used, it isn't usually a user-accessible feature. That is, CPU cache control is typically a privileged operation.

That said...

-- A normal user program can be build with functions who's attributes are "hot" or "cold" to let the compiler tell the loader to group the functions in ways that will utilize the cache most usefully.

-- A normal program can use the madvise() function in linux to tell the paging function various things, including the fact that the memory just used is or is not likely to be used again soon.

-- The kernel itself uses the Memory Type Range Regesters (mtrr) and Page Attribute Table (pat) flags in later kernels, to tell the hardware that particular ranges of memory (such as the memory mapped display buffer, and the various parts of the PCI bus) are not to be cached.

"Normal Data™" such as you are likely to use in any C program will essentially never benefit from marking any of its data not cache-worthy. The performance improvement that not-cached data enjoys is the subsequent absence of the various cache-flush and memory barrier operations that memory mapped devices and display buffers would need almost constantly. Laying a cache over a memory mapped device, for example, would require a cache invalidate command before every read and a cache forced write command after every single write to make sure that the reads and writes happen at the exact moment needed. This would "poison" the cache usage, using up and instantly discarding cache lines (a physically limited resource) in a most unfriendly and unhelpful way.

In the rare case that you write a program that gains access to one of these cache harmful regions -- such as if you wrote part of the X display server on a linux system -- the kernel would have already set the registers for the device and the non-cache behavior would be transparent to you.

There is effectively no time where your normal application grade program is going to benefit from any ability to mark a variable as harmful to cache beyond the various madvise() type of usage.

Even then, the cases were you could gain any benefit are so rare that if you'd ever acutally run into one, the problem set would have included the need and methodology as part of your research and you'd have been told how and why so explicitly you'd never have needed to ask this question.

To go back to the same example again, if you'd been writing the necessary driver, when you'd been reading up on the display adapter hardware or the PCI bus the various flags and techniques would have been documented and discussed in the hardware guide.

There are ways to pull off cache ejection and such from user space with things like the CLCLEAR instruction on an intel platform. These techniques will not improve general performance.

Since it's a privileged operation on a Linux system, you could write a kernel driver that acquired and marked a region of memory as uncacheable and then let you map it into your application. But the need for such a region is so rare, and so likely to be misused, that there isn't a normal methodology for doing it in place.

So how do you do it? You don't, at least not the you that you are today. When you become a kernel driver writer with an intimate specialty knowledge of multi-threaded code and data synchronization issues, you'll know how you could do it, and at that point you'll know why you don't want to except as a last resort.

TL;DR :: because of the way linux uses and manages data and code, there is never a benefit for marking any part of a normal application as uncacheable that doesn't cause more heartbreak than it saves. As such, there is no unprivileged API for doing this.

P.S. Also, that said, someone already pointed to things that lead to this article http://lwn.net/Articles/255364/ which covers ways to make your program very cache friendly and some of the ways that you can do some cache bypass operations very cheaply. For instance use of memset() tends to go around the cache while setting memory, and some operations can "stream past" the cache. This isn't the same thing as what you ask, but once you understand all of that article you'll have a much better understanding of why marking a region of memory as uncachable is usually, as the Jedi say, not the solution you are looking for.

Disabling cache in a kernel module on a Raspberry Pi 3?

If it's just that caching shouldn't be forced here, but is allowed, you could use non-temporal hints to achieve this:

6.3.8 Non-temporal load and store pair
A new concept in ARMv8 is the non-temporal load and store. These are the LDNP and STNP instructions that perform a read or write of a pair of register values. They also give a hint to the memory system that caching is not useful for this data. The hint does not prohibit memory system activity such as caching of the address, preload, or gathering. However, it indicates that caching is unlikely to increase performance. A typical use case might be streaming data, but take note that effective use of these instructions requires an approach specific to the
microarchitecture.
From The Programmer’s Guide for ARMv8-A.

The syntax of this two instructions is

LDNP <Wt1>, <Xt2>, [<Xn|SP|{, #<imm>}]
LDNP <Xt1>, <Xt2>, [<Xn|SP|{, #<imm>}]

These are instructions on register pairs, but the immediate can also have the width of a single register (then simply use WZR or XZR as the second register). As the description says, it may still be that the cache is used here. However, this method can be used directly from user mode.

From a kernel module there would also be the option of marking the corresponding memory area as non-cacheable. This is normally used for memory-mapped I/O devices, in which current values should always be fetched directly from the memory. That should be configured in the MMU translation tables. This seems to be the best solution for your case to me, but it requires some knowledge of the MMU and the paging mechanism (and therefore I would not like to describe it further here).

Finally there would be the possibility to deactivate the entire cache, but there is already a post about this and the consequences here: Disable CPU caches (L1/L2) on ARMv8-A Linux.

How to Disable CPU Cache (L1/L2) on a Linux System

Disable CPU caches (L1/L2) on ARMv8-A Linux

How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?

How can i disable cpu cache for certain memory regions?

Disabling cache in a kernel module on a Raspberry Pi 3?

Related Topics

Leave a reply