Disable CPU Caches (L1/L2) on Armv8-A Linux

Disable CPU caches (L1/L2) on ARMv8-A Linux

In general, this is unworkable, for several reasons.

Firstly, clearing the SCTLR.C bit only makes all data accesses non-cacheable. and prevents allocating into any caches. Any data in the caches is still there in the caches, especially dirty lines from anything recently-written; consider what happens when your function returns and the caller tries to restore a stack frame which doesn't even exist in the memory it's now accessing.

Secondly, there are very few uniprocessor ARMv8 systems; assuming you're running SMP Linux, and suddenly disable the caches on just whichever CPU the module loader happened to be scheduled on, then even disregarding the first point things are going to go downhill very fast. Linux expects all CPUs to be coherent with each other, and will typically become very broken very rapidly if that assumption is violated. Note that it's not even worth venturing into SMP cross-calling for this; suffice to say the only safe way to even attempt to run Linux with caches disabled is to make sure they are never enabled to begin with, except...

Thirdly, there is no guarantee Linux will even run with caches disabled. On current hardware, all of the locking and atomic operations in the kernel (not to mention userspace) rely on the exclusive access instructions. Whilst the CPU cluster(s) will implement the architecturally-required local and global exclusive monitors for cacheable memory (usually as part of the cache machinery itself), it is dependent on the system whether a global exclusive monitor for non-cacheable accesses is implemented, as such a thing must be external to the CPU (usually in the interconnect or memory controller). Many systems don't implement such a global monitor, in which case exclusive accesses to external memory may fault, do nothing, or other various implementation-defined behaviours which will result in Linux crashing or deadlocking. It is effectively impossible to run Linux with the cache off on such a system - the amount of hacking just to get a UP arm64 kernel to work (SMP would be literally impossible) would be impractical alone; good luck with userspace.

As it happens, though, the worst problem is none of that, it's this:

...in order to measure performance of optimized code, independent of cache access.

If the code is intended to run in deployment with caches disabled, then logically it can't be intended to run under Linux, therefore the effort spent in hacking up Linux would be better spent on benchmarking in a more realistic execution environment so that results are actually representative. On the other hand, if it is intended to run with caches enabled (under Linux or any other OS), then benchmarking with caches disabled will give meaningless results and be a waste of time. "Optimising" for e.g. an instruction-fetch-bandwidth bottleneck which won't exist in practice is not going to lead you in the right direction.

How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?

The Intel's manual 3A, Section 11.5.3, provides an algorithm to globally disable the caches:

11.5.3 Preventing Caching

To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills, perform the following steps:

Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag to 0.

Flush all caches using the WBINVD instruction.

Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the uncached memory
type (see the discussion of the discussion of the TYPE field and the E flag in Section 11.11.2.1,
“IA32_MTRR_DEF_TYPE MSR”).

The caches must be flushed (step 2) after the CD flag is set to ensure system memory coherency. If the caches are
not flushed, cache hits on reads will still occur and data will be read from valid cache lines.

The intent of the three separate steps listed above addresses three distinct requirements: (i) discontinue new data
replacing existing data in the cache (ii) ensure data already in the cache are evicted to memory, (iii) ensure subsequent memory references observe UC memory type semantics. Different processor implementation of caching
control hardware may allow some variation of software implementation of these three requirements. See note below.

NOTES
Setting the CD flag in control register CR0 modifies the processor’s caching behaviour as indicated
in Table 11-5, but setting the CD flag alone may not be sufficient across all processor families to
force the effective memory type for all physical memory to be UC nor does it force strict memory
ordering, due to hardware implementation variations across different processor families. To force
the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either
program the MTRRs for all physical memory to be UC memory type or disable all MTRRs.

For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been
executed, the cache lines containing the code between the end of the WBINVD instruction and
before the MTRRS have actually been disabled may be retained in the cache hierarchy. Here, to remove code from the cache completely, a second WBINVD instruction must be executed after the
MTRRs have been disabled.

That's a long quote but it boils down to this code

;Step 1 - Enter no-fill mode
mov eax, cr0
or eax, 1<<30        ; Set bit CD
and eax, ~(1<<29)    ; Clear bit NW
mov cr0, eax

;Step 2 - Invalidate all the caches
wbinvd

;All memory accesses happen from/to memory now, but UC memory ordering may not be enforced still.  

;For Atom processors, we are done, UC semantic is automatically enforced.

xor eax, eax
xor edx, edx
mov ecx, IA32_MTRR_DEF_TYPE    ;MSR number is 2FFH
wrmsr

;P4 only, remove this code from the L1I
wbinvd

most of which is not executable from user mode.

AMD's manual 2 provides a similar algorithm in section 7.6.2

7.6.2 Cache Control Mechanisms
The AMD64 architecture provides a number of mechanisms for controlling the cacheability of memory. These are described in the following sections.

Cache Disable. Bit 30 of the CR0 register is the cache-disable bit, CR0.CD. Caching is enabled
when CR0.CD is cleared to 0, and caching is disabled when CR0.CD is set to 1. When caching is
disabled, reads and writes access main memory.

Software can disable the cache while the cache still holds valid data (or instructions). If a read or write
hits the L1 data cache or the L2 cache when CR0.CD=1, the processor does the following:

Writes the cache line back if it is in the modified or owned state.

Invalidates the cache line.

Performs a non-cacheable main-memory access to read or write the data.

If an instruction fetch hits the L1 instruction cache when CR0.CD=1, some processor models may read
the cached instructions rather than access main memory. When CR0.CD=1, the exact behavior of L2
and L3 caches is model-dependent, and may vary for different types of memory accesses.

The processor also responds to cache probes when CR0.CD=1. Probes that hit the cache cause the
processor to perform Step 1. Step 2 (cache-line invalidation) is performed only if the probe is
performed on behalf of a memory write or an exclusive read.

Writethrough Disable. Bit 29 of the CR0 register is the not writethrough disable bit, CR0.NW. In
early x86 processors, CR0.NW is used to control cache writethrough behavior, and the combination of
CR0.NW and CR0.CD determines the cache operating mode.

[...]

In implementations of the AMD64 architecture, CR0.NW is not used to qualify the cache operating
mode established by CR0.CD.

This translates to this code (very similar to the Intel's one):

;Step 1 - Disable the caches
mov eax, cr0
or eax, 1<<30
mov cr0, eax

;For some models we need to invalidated the L1I
wbinvd

;Step 2 - Disable speculative accesses
xor eax, eax
xor edx, edx
mov ecx, MTRRdefType  ;MSR number is 2FFH
wrmsr

Caches can also be selectively disabled at:

Page level, with the attribute bits PCD (Page Cache Disable) [Only for Pentium Pro and Pentium II].

When both are clear the MTTR of relevance is used, if PCD is set the aching
Page level, with the PAT (Page Attribute Table) mechanism.

By filling the IA32_PAT with caching types and using the bits PAT, PCD, PWT as a 3-bit index it's possible to select one the six caching types (UC-, UC, WC, WT, WP, WB).
Using the MTTRs (fixed or variable).

By setting the caching type to UC or UC- for specific physical areas.

Of these options only the page attributes can be exposed to user mode programs (see for example this).

Disabling cache in a kernel module on a Raspberry Pi 3?

If it's just that caching shouldn't be forced here, but is allowed, you could use non-temporal hints to achieve this:

6.3.8 Non-temporal load and store pair
A new concept in ARMv8 is the non-temporal load and store. These are the LDNP and STNP instructions that perform a read or write of a pair of register values. They also give a hint to the memory system that caching is not useful for this data. The hint does not prohibit memory system activity such as caching of the address, preload, or gathering. However, it indicates that caching is unlikely to increase performance. A typical use case might be streaming data, but take note that effective use of these instructions requires an approach specific to the
microarchitecture.
From The Programmer’s Guide for ARMv8-A.

The syntax of this two instructions is

LDNP <Wt1>, <Xt2>, [<Xn|SP|{, #<imm>}]
LDNP <Xt1>, <Xt2>, [<Xn|SP|{, #<imm>}]

These are instructions on register pairs, but the immediate can also have the width of a single register (then simply use WZR or XZR as the second register). As the description says, it may still be that the cache is used here. However, this method can be used directly from user mode.

From a kernel module there would also be the option of marking the corresponding memory area as non-cacheable. This is normally used for memory-mapped I/O devices, in which current values should always be fetched directly from the memory. That should be configured in the MMU translation tables. This seems to be the best solution for your case to me, but it requires some knowledge of the MMU and the paging mechanism (and therefore I would not like to describe it further here).

Finally there would be the possibility to deactivate the entire cache, but there is already a post about this and the consequences here: Disable CPU caches (L1/L2) on ARMv8-A Linux.

cpu hotplug - is there a system call to disable a cpu in linux?

There is no syscall for disabling a cpu in linux. What you found article is the only method. But you can rewrite the shell script to the below:

static void set_cpu_online(int cpu, int online)
{
        int fd;
        int ret;
        char path[256];

        snprintf(path, sizeof(path) - 1,
                 "/sys/devices/system/cpu/cpu%d/online", cpu);

        fd = open(path, O_RDWR);
        assert(fd > 0);

        ret = write(fd, "0" + (online ? 1 : 0), 1);
        assert(ret == 1);
}

ARM Bootloader: Disable MMU and Caches

Last question is why D-cache is disabled but I-caches is able? To speed up instrument process?

The MMU has settings to determine which memory regions are cacheable or not. If you do not have the mmu on but you have the data cache on (if possible) then you cannot safely talk to peripherals. if you read the uart status register for example that goes through the cache just like any other data operation, whatever that status is stays in the cache for subsequent reads until such time as that cache line is evicted and you get one more shot at the actual register. Lets say for example you have some code that polls the uart status register waiting for a character in the rx buffer. If that first read shows there is no character, that status goes in the cache, you will remain in the loop forever since you will never get to talk to the status register again you will simply get the cached copy of the register. if there was a character in there then that status also gets cached, you read the rx register, and perhaps do something, if when you come back again if the status has not been evicted from the data cache then you get the stale status which shows there is a character, you rx buffer read may or may not also be cached so you may get the stale value in the cache, you may get a stale value or whatever the peripheral does when you read and there is no new value or you might get a new value, but what you dont get in these situations is proper access to the peripheral. When the mmu is on, you use the mmu to mark the address space used by that peripheral as non-(data)-cacheable, and you dont have this problem. With the mmu off you need the data cache off for arm systems.

Leaving the I-cache on is okay because instruction fetches only read instructions...Well for a bare metal application that is okay, it helps for example if you are using a flash that has a potential for read disturb (spi or i2c flashes). The problem is this application is a bootloader, so you must take some extra care. For example your bootloader has some code at address 0x8000 that it runs through at least once, then you choose to use it as a bootloader, the bootloader might be at say address 0x10000000 allowing you to load a new program at 0x8000, this load uses data accesses so it does not go through the instruction cache. So there is a potential that the instruction cache has some or all of the code from the last time you were in the 0x8000 area, and when you branch to the bootloaded code at 0x8000 you will get either the old program from cache or a nasty mixture of old program and new program for the parts that are cached and not cached. So if your bootloader allows for the i-cache to be on, you need to invalidate the cache before branching to bootloaded code.

Lastly, if you or anyone using this bootloader wants to use jtag, then you have that same problem but worse, data cycles that do not go through the i-cache are used to write the new program to ram, when you tell the jtag debugger to then run the new program you will get 1) only the new program, 2) a mixture of the new program and old program fragments from cache 3) the old program from cache.

So d-cache is bad without an mmu because of things that are not in ram, peripherals, etc. The i-cache is a use at your own risk kind of thing which you can mitigate except for the times that jtag is used for debugging.

If you have concerns or have confirmed read-disturb in your (external) flash, then I recommend turn on the i-cache, use a tight loop to copy your application to ram, branch to the ram copy and run there, turn off the i-cache (or use at your own risk) and dont touch the flash again, certainly not heavy read accesses to small areas. A tight uart polling loop like you might have for a command line parser, is a really good place to get hit with read-disturb.

How to put data in L2 cache with A72 Core?

How to do the same thing but in L2 Cache area ?

Your hardware doesn't support that.

In general, the ARMv8 architecture doesn't make any guarantees about the contents of caches and does not provide any means to explicitly manipulate or query them - it only makes guarantees and provides tools for dealing with coherency.

Specifically, from section D4.4.1 "General behavior of the caches" of the spec:

[...] the architecture cannot guarantee whether:

• A memory location present in the cache remains in the cache.
• A memory location not present in the cache is brought into the cache.

Instead, the following principles apply to the behavior of caches:

• The architecture has a concept of an entry locked down in the cache.
  How lockdown is achieved is IMPLEMENTATION DEFINED, and lockdown might
  not be supported by:

  — A particular implementation.
  — Some memory attributes.

• An unlocked entry in a cache might not remain in that cache. The
  architecture does not guarantee that an unlocked cache entry remains in
  the cache or remains incoherent with the rest of memory. Software must
  not assume that an unlocked item that remains in the cache remains dirty.

• A locked entry in a cache is guaranteed to remain in that cache. The
  architecture does not guarantee that a locked cache entry remains
  incoherent with the rest of memory, that is, it might not remain dirty.

[...]

• Any memory location is not guaranteed to remain incoherent with the rest of memory.

So basically you want cache lockdown. Consulting the manual of your CPU though:

• The Cortex-A72 processor does not support TLB or cache lockdown.

So you can't put something in cache on purpose. Now, you might be able to tell whether something has been cached by trying to observe side effects. The two common side effects of caches are latency and coherency. So you could try and time access times or modify the contents of DRAM and check whether you see that change in your cached mapping... but that's still a terrible idea.

For one, both of these are destructive operations, meaning they will change the property you're measuring, by measuring it. And for another, just because you observe them once does not mean you can rely on that happening.

Bottom line: you cannot guarantee that something is held in any particular cache by the time you use it.

Disable CPU Caches (L1/L2) on Armv8-A Linux