How to Check Whether the Processor Cache Has Been Flushed Recently

Is there a way to check whether the processor cache has been flushed recently?

Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.

This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.

Here is the output:


took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks

You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.

If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.

Source code:

#include <stdio.h>
#include <stdint.h>

inline void
clflush(volatile void *p)
{
    asm volatile ("clflush (%0)" :: "r"(p));
}

inline uint64_t
rdtsc()
{
    unsigned long a, d;
    asm volatile ("rdtsc" : "=a" (a), "=d" (d));
    return a | ((uint64_t)d << 32);
}

volatile int i;

inline void
test()
{
    uint64_t start, end;
    volatile int j;

    start = rdtsc();
    j = i;
    end = rdtsc();
    printf("took %lu ticks\n", end - start);
}

int
main(int ac, char **av)
{
    test();
    test();
    printf("flush: ");
    clflush(&i);
    test();
    test();
    return 0;
}

How to flush the CPU cache for a region of address space in Linux?

Check this page for list of available flushing methods in linux kernel: https://www.kernel.org/doc/Documentation/cachetlb.txt

Cache and TLB Flushing Under Linux. David S. Miller

There are set of range flushing functions

2) flush_cache_range(vma, start, end);
   change_range_of_page_tables(mm, start, end);
   flush_tlb_range(vma, start, end);

3) void flush_cache_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)

Here we are flushing a specific range of (user) virtual
addresses from the cache.  After running, there will be no
entries in the cache for 'vma->vm_mm' for virtual addresses in
the range 'start' to 'end-1'.

You can also check implementation of the function - http://lxr.free-electrons.com/ident?a=sh;i=flush_cache_range

For example, in arm - http://lxr.free-electrons.com/source/arch/arm/mm/flush.c?a=sh&v=3.13#L67

 67 void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 68 {
 69         if (cache_is_vivt()) {
 70                 vivt_flush_cache_range(vma, start, end);
 71                 return;
 72         }
 73 
 74         if (cache_is_vipt_aliasing()) {
 75                 asm(    "mcr    p15, 0, %0, c7, c14, 0\n"
 76                 "       mcr     p15, 0, %0, c7, c10, 4"
 77                     :
 78                     : "r" (0)
 79                     : "cc");
 80         }
 81 
 82         if (vma->vm_flags & VM_EXEC)
 83                 __flush_icache_all();
 84 }

Are CPU caches flushed to memory during I/O?

On x86 and x64 series, all caches are coherent and you'll get 42 on T2 even if the two threads do not share the same cache unit.

Your thought experiment can be further reduced to 2 cases: the two threads share the same cache unit (multi-core) or do not share (multi-cpu).

When they share a cache unit, both T1 and T2 will use the same cache, therefore they will both see 42 without any synchronization to memory.

In case the caches are not shared (i.e., multi-cpu), the ISA requires that the cache units be synchronized, and that will be transparent to software. Both threads will see 42 at the same address. This synchronization introduces some overhead though, therefore nowadays multi-core design is preferred (beside the reason cache is expensive).

In Java, how do I flush the CPU memory cache so it retrieves the latest from the RAM memory?

You generally cannot. Except for performance, the existence of the cache is transparent to applications. This applies to multiprocessor systems as well, as long as the cache system is "cache-coherent", which is practically all platforms in use today.

Thus, your bug lies elsewhere.

Things like volatile and synchronized blocks don't affect the cache coherency per se, but rather register optimizations, use of atomic instructions (including, say, flushing the store buffer, which is not the same as the cache!), and so on. These are the things you should be looking at (well, given the lack of detail in your description, it's hard to say, but as a first guess..), rather than trying to flush the cache.

Is processor cache flushed during context switch in multicore?

Whenever there is any sort of context switch the OS will save its state such that it can be restarted again on any core (assuming it has not been tied to a specific core using a processor affinity function).

Saving a state that for some reason or other is incomplete defeats the entire purpose of threading or processing, therefore the caches are flushed as part of the switch.

Volatile has nothing whatsover to do with context switching. It merely tells the compiler that the memory location is subject to change without notice and therefore must be accessed each time the source code dictates it i e the usual compiler behaviour of optimizing accesses to memory locations does not apply to the volatile-declared location.

AArch64, multi level cache flush, order of level flushing

The ARMv8 Reference Manual says under D4.4.7:

The points to which a cache maintenance instruction can be defined differ depending on whether the instruction operates by VA or by set/way.
For instructions operating by set/way, the point is defined to be to the next level of caching. [...]

So the correct order should be L1, then L2.