Programmatically Counting Cache Faults

Programmatically counting cache faults

It looks like now there is exactly what I was searching for: perf_event_open.

It lets you do interesting things like initializing/enabling/disabling some performance counters for subsequently fetching their values through an uniform and intuitive API (it gives you a special file descriptor which hosts a struct containing the previously requested informations).

It is a linux-only solution and the functionalities varies depending on the kernel version, so be careful :)

What will be the exact code to get count of last level cache misses on Intel Kaby Lake architecture

You can use perf as Cody suggested to measure the events from outside the code, but I suspect from your code sample that you need fine-grained, programmatic access to the performance counters.

To do that, you need to enable user-mode reading of the counters, and also have a way to program them. Since those are restricted operations, you need at least some help from the OS kernel to do that. Rolling your own solution is going to be pretty difficult, but luckily there are several existing solutions for Ubunty 16.04:

Andi Kleen's jevents library, which among other things lets you read PMU events from user space. I haven't personally used this part of pmu-tools, but the stuff I have used has been high quality. It seems to use the existing perf_events syscalls for counter programming so and doesn't need a kernel model.
The libpfc library is a from-scratch implementation of a kernel module and userland code that allows userland reading of the performance counters. I've used this and it works well. You install the kernel module which allows you to program the PMU, and then use the API exposed by libpfc to read the counters from userspace (the calls boil down to rdpmc instructions). It is the most accurate and precise way to read the counters, and it includes "overhead subtraction" functionality which can give you the true PMU counts for the measured region by subtracting out the events caused by the PMU read code itself. You need to pin to a single core for the counts to make sense, and you will get bogus results if your process is interrupted.
Intel's open-sourced Processor Counter Monitor library. I haven't tried this on Linux, but I used its predecessor library, the very similarly named¹ Performance Counter Monitor on Windows, and it worked. On Windows it needs a kernel driver, but on Linux it seems you can either use a drive or have it go through perf_events.
Use the likwid library's Marker API functionality. Likwid has been around for a while and seems well supported. I have used likwid in the past, but only to measure whole processes in a matter similar to perf stat and not with the marker API. To use the marker API you still need to run your process as a child of the likwid measurement process, but you can read programmatically the counter values within your process, which is what you need (as I understand it). I'm not sure how likwid is setting up and reading the counters when the marker API is used.

So you've got a lot of options! I think all of them could work, but I can personally vouch for libpfc since I've used it myself for the same purpose on Ubuntu 16.04. The project is actively developed and probably the most accurate (least overhead) of the above. So I'd probably start with that one.

All of the solutions above should be able to work for Kaby Lake, since the functionality of each successive "Performance Monitoring Architecture" seems to generally be a superset of the prior one, and the API is generally preserved. In the case of libpfc, however, the author has restricted it to only support Haswell's architecture (PMA v3), but you just need to change one line of code locally to fix that.

¹ Indeed, they are both commonly called by their acronym, PCM, and I suspect that the new project is simply the officially open sourced continuation of the old PCM project (which was also available in source form, but without a mechanism for community contribution).

Identifying Major Page Fault cause

You may like to consider Valgrid, described on the home page as:

Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools.

Specifically Valgrind contains a tool called Massif, for which the following (paraphrased) overview is given in the manual:

Massif is a heap profiler. It measures how much heap memory your program uses. [..]
Heap profiling can help you reduce the amount of memory your program uses. On modern machines with virtual memory, this provides the following benefits:
It can speed up your program -- a smaller program will interact better with your machine's caches and avoid paging.
If your program uses lots of memory, it will reduce the chance that it exhausts your machine's swap space.

Programmatically get accurate CPU cache hierarchy information on Linux

A full picture of the cache hierarchy can be found programmatically by opening files in /sys (sysfs).

Each "thread" or "logical processor" is represented by a sub-directory in /sys/devices/system/cpu/. Within that directory you'll find a cache directory. For example, cache information for the first logical processor can be found here:

$ ls /sys/devices/system/cpu/cpu0/cache/
index0
index1
index2
index3
power
uevent

Each cache entity associated with that logical processor is represented by an index[0-9]* directory. The number after index does not represent the level. The same cache entity may be listed multiple times under different logical processors. Within these directories you can find all the properties of the cache entity (level, sets, line size, etc).

$ ls /sys/devices/system/cpu/cpu0/cache/index0
coherency_line_size
level
number_of_sets
physical_line_partition
power
shared_cpu_list
shared_cpu_map
size
type
uevent
ways_of_associativity

Full documentation can be found here.

Most importantly, to get the output you want, you'll need to inspect shared_cpu_list:

$ cat /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list
0,28

This will show you what logical processors share this cache entity. By inspecting all entities (/sys/devices/system/cpu/cpu*/cache/index*/), and eliminating duplicates using shared_cpu_list, you can programmatically access all the data you require.

Note that your hypervisor isn't required to pass along accurate information. This will only show you the cache hierarchy as the guest kernel sees it.

What is a cache-friendly code?

Preliminaries

On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.

(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).

Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.

Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.

In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!

There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:

Agner Fog's page. In his excellent documents, you can find detailed examples covering languages ranging from assembly to C++.
If you are into videos, I strongly recommend to have a look at Herb Sutter's talk on machine architecture (youtube) (specifically check 12:00 and onwards!).
Slides about memory optimization by Christer Ericson (director of technology @ Sony)
LWN.net's article "What every programmer should know about memory"

Main concepts for cache-friendly code

A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?

The following particular aspects are of high importance to optimize caching:

Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
Spatial locality: this refers to placing related data close to each other. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.

Use appropriate c++ containers

A simple example of cache-friendly versus cache-unfriendly is c++'s std::vector versus std::list. Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.

A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to @Mohammad Ali Baydoun for the link!).

Don't neglect the cache in data structure and algorithm design

Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. A common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).

Know and exploit the implicit structure of data

Another simple example, which many people in the field sometimes forget is column-major (ex. fortran,matlab) vs. row-major ordering (ex. c,c++) for storing two dimensional arrays. For example, consider the following matrix:

1 2
3 4

In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). @MatteoItalia showed this example in more detail in his answer.

When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).

For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):

Exploiting the ordering (e.g. changing column index first in c++):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses

Not exploiting the ordering (e.g. changing row index first in c++):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses

In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.

Avoid unpredictable branches

Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.

This is explained very well here (thanks to @0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?

Avoid virtual functions

In the context of c++, virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?

Common problems

A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation.
See also (thanks to @Matt for the link): How and when to align to cache line size?

An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.

C Program to determine Levels & Size of Cache

The time it takes to measure your time (that is, the time just to call the clock() function) is many many (many many many....) times greater than the time it takes to perform arr[(i*16)&lengthMod]++. This extremely low signal-to-noise ratio (among other likely pitfalls) makes your plan unworkable. A large part of the problem is that you're trying to measure a single iteration of the loop; the sample code you linked is attempting to measure a full set of iterations (read the clock before starting the loop; read it again after emerging from the loop; do not use printf() inside the loop).

If your loop is large enough you might be able to overcome the signal-to-noise ratio problem.

As to "what element is being incremented"; arr is an address of a 1MB buffer; arr[(i * 16) & lengthMod]++; causes (i * 16) * lengthMod to generate an offset from that address; that offset is the address of the int that gets incremented. You're performing a shift (i * 16 will turn into i << 4), a logical and, an addition, then either a read/add/write or a single increment, depending on your CPU).

Edit:
As described, your code suffers from a poor SNR (signal to noise ratio) due to the relative speeds of memory access (cache or no cache) and calling functions just to measure the time. To get the timings you're currently getting, I assume you modified the code to look something like:

int main() {
    int steps = 64 * 1024 * 1024;
    int arr[1024 * 1024];
    int lengthMod = (1024 * 1024) - 1;
    int i;
    double timeTaken;
    clock_t start;

    start = clock();
    for (i = 0; i < steps; i++) {
        arr[(i * 16) & lengthMod]++;
    }
    timeTaken = (double)(clock() - start)/CLOCKS_PER_SEC;
    printf("Time for %d: %.12f \n", i, timeTaken);
}

This moves the measurement outside the loop so you're not measuring a single access (which would really be impossible) but rather you're measuring steps accesses.

You're free to increase steps as needed and this will have a direct impact on your timings. Since the times you're receiving are too close together, and in some cases even inverted (your time oscillates between sizes, which is not likely caused by cache), you might try changing the value of steps to 256 * 1024 * 1024 or even larger.

NOTE: You can make steps as large as you can fit into a signed int (which should be large enough), since the logical and ensures that you wrap around in your buffer.

Clear Cache in Android Application programmatically

If you are looking for delete cache of your own application then simply delete your cache directory and its all done !

public static void deleteCache(Context context) {
    try {
        File dir = context.getCacheDir();
        deleteDir(dir);
    } catch (Exception e) { e.printStackTrace();}
}

public static boolean deleteDir(File dir) {
    if (dir != null && dir.isDirectory()) {
        String[] children = dir.list();
        for (int i = 0; i < children.length; i++) {
            boolean success = deleteDir(new File(dir, children[i]));
            if (!success) {
                return false;
            }
        }
        return dir.delete();
    } else if(dir!= null && dir.isFile()) {
        return dir.delete();
    } else {
        return false;
    }
}

Programmatically Counting Cache Faults