Command to Measure Tlb Misses on Linux

Command to measure TLB misses on LINUX

You can use perf to do this. Provided your CPU supports it.

Use perf list to get some idea of the counters available. When I took this list and grepped for TLB (on my Sandy Bridge machine) I got:

rob@tartarus:~$ perf list | grep -i tlb
dTLB-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-stores [Hardware cache event]
dTLB-store-misses [Hardware cache event]
dTLB-prefetches [Hardware cache event]
dTLB-prefetch-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
iTLB-load-misses [Hardware cache event]

You can then use this particular counter with: perf record -e <event0>,<event1>,..

And then just use perf report to look at the results.

simplest tool to measure C program cache hit/miss and cpu time in linux?

Use perf:

perf stat ./yourapp

See the kernel wiki perf tutorial for details. This uses the hardware performance counters of your CPU, so the overhead is very small.

Example from the wiki:

perf stat -B dd if=/dev/zero of=/dev/null count=1000000

Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)
217 page-faults # 0.000 M/sec
3 CPU-migrations # 0.000 M/sec
83 context-switches # 0.000 M/sec
956.474238 task-clock-msecs # 0.999 CPUs

0.957617512 seconds time elapsed

No need to load a kernel module manually, on a modern debian system (with the linux-base package) it should just work. With the perf record -a / perf report combo you can also do full-system profiling. Any application or library that has debugging symbols will show up with details in the report.

For visualization flame graphs seem to work well. (Update 2020: the hotspot UI has flame graphs integrated.)

How does Linux perf calculate the cache-references and cache-misses events

The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:

  523,288,816      cache-references        (architectural event: LLC Reference)                             
205,331,370 cache-misses (architectural event: LLC Misses)
237,794,728 L1-dcache-load-misses L1D.REPLACEMENT
3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS
2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES
531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS
77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
27,462,249 LLC-load-misses OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
15,039,473 LLC-stores OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
3,829,429 LLC-store-misses OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)

All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.

But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.

LLC-load-misses and LLC-store-misses count only cacheable data read requests and RFO requests, respectively, that miss in the L3 cache. LLC-load-misses also includes reads for page walking. Both exclude hardware and software prefetching. (The difference compared to Haswell is that some types of prefetch requests are counted.)

cache-misses also includes prefetch requests and code fetch requests that miss in the L3 cache. All of these events only count core-originating requests. They include requests from uops irrespective of whether end up retiring and irrespective of the source of the response. It's unclear to me how a prefetch promoted to demand is counted.

Overall, I think cache-misses is always larger than LLC-load-misses + LLC-store-misses and cache-references is always larger than LLC-loads + LLC-stores.

The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores

It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.

The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.

No, it's a trap. They are not easy to understand.

AMD: performance counter for cycles on TLB miss

It seems to me you're looking for events similar to Intel's *.WALK_DURATION or *.WALK_ACTIVE on AMD Zen processors. There are no such events with the same exact meaning, but there are similar events.

The closest events are the IBS performance data fields IbsTlbRefillLat and IbsItlbRefillLat, which measure the number cycles it takes to fulfill an L1 DTLB or L1 ITLB miss, respectively, in case of miss for the selected instruction fetch or uop. Note that in perf record, IbsTlbRefillLat can be captured with the ibs_fetch PMU and IbsItlbRefillLat can be captured with the ibs_op PMU.

The event Core::X86::Pmc::Core::LsTwDcFills is also useful. It counts the number of L1 data cache fills for page table walks that miss in the L1 for each data source (local L2, L3 on the same die, L3 on another die, DRAM or IO on the same die, DRAM or IO on another die). Walks fulfilled from farther sources are more expensive and would probably have a larger impact on performance. This event doesn't count walks that hit in the L1 data cache, although there are other events that count L2 TLB misses. Also, this event only count for L2 DTLB misses and not ITLB misses.

In current versions of upstream kernel, LsTwDcFills is not listed by perf list and so perf doesn't know the event by name. So you'll have specify the event code using the syntax cpu/event=0x5B, umask=0x0/. This event represents any page table walk for a data load or store for which there is an allocated MAB (meaning that the walker missed in the L1D). You can filter the count according to the response by specifying an appropriate umask value as defined in the manual. For example, the event cpu/event=0x5B, umask=0x48/ represents a walk where the response came from local or remote main memory.

One good approach for utilizing all of these monitoring facilities as a small part of your overall microarchitectural performance analysis methodology is to first monitor LsTwDcFills. If it exceeds some threshold compared to the total number of memory accesses (excluding instruction fetches), then capture IbsTlbRefillLat for sampled uops to locate where in your code these expensive walks are occurring. Similarly, for instruction fetch walks, use the event Core::X86::Pmc::Core::BpL1TlbMissL2Hit for counting total walks and if the count is too large with respect to total fetches, use IbsItlbRefillLat to locate where in your code the most expensive walks are occurring.



Related Topics



Leave a reply



Submit