Difference Between "Cpu/Mem-Loads/Pp" and "Cpu/Mem-Loads/"

What is the difference between cpu/mem-loads/pp and cpu/mem-loads/?

The p modifier stands for precise level When doing sampling, it used to indicate the skid you tolerate: how far can be t reported instruction from the effective instructions that generated the sample. pp means that the SAMPLE_IP is requested to have 0 skid. In other word, when you do memory accesses sampling, you want to know exactly which instruction generated the access.

See man perf list:

p - precise level
....
       The p modifier can be used for specifying how precise the instruction address should be. The p modifier can be specified multiple times:

           0 - SAMPLE_IP can have arbitrary skid
           1 - SAMPLE_IP must have constant skid
           2 - SAMPLE_IP requested to have 0 skid
           3 - SAMPLE_IP must have 0 skid

       For Intel systems precise event sampling is implemented with PEBS which supports up to precise-level 2.

       On AMD systems it is implemented using IBS (up to precise-level 2). The precise modifier works with event types 0x76 (cpu-cycles, CPU clocks not halted) and 0xC1 (micro-ops
       retired). Both events map to IBS execution sampling (IBS op) with the IBS Op Counter Control bit (IbsOpCntCtl) set respectively (see AMD64 Architecture Programmer’s Manual Volume
       2: System Programming, 13.3 Instruction-Based Sampling). Examples to use IBS:

           perf record -a -e cpu-cycles:p ...    # use ibs op counting cycles
           perf record -a -e r076:p ...          # same as -e cpu-cycles:p
           perf record -a -e r0C1:p ...          # use ibs op counting micro-ops

What does this sentence mean in the context of perf tool: Supports address when precise (Precise event)?

"Precise" events mean using PEBS instead of the traditional firing an interrupt when the counter overflows. Instead it writes a sample in a buffer to be collected later, so it can attribute it to the right instruction without pipeline / retirement effects delaying it (e.g. waiting until the currently-last instruction retires, I think to ensure forward progress, causing a "skid").

The PEBS buffer also gives it a place to put additional data, like an address associated with the event that triggered recording a sample.

https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR#processor-event-based-sampling-pebs

Also related with discussion about or details of PEBS and how perf uses it for event:pp -

Good resources on how to program PEBS (Precise event based sampling) counters?
What is the difference between "cpu/mem-loads/pp" and "cpu/mem-loads/"?
Which perf events can use PEBS?
Perf shows L1-dcache-load-misses in a block with no memory access

What is the difference between PM_DATA_ALL* and PM_DATA* events on Power8?

After few more hours of searching, I found another source directly from IBM describing the events as:

PM_DATA_ALL_FROM_LMEM
The processor's data cache was reloaded from the local chip's Memory due to either demand loads or data prefetch

and

PM_DATA_FROM_LMEM
The processor's data cache was reloaded from the local chip's Memory due to a demand load

So the difference makes prefetch load, which is not included in the second version.

The PAPI and perf tools just include wrong description. These events were contributed directly to oprofile by IBM but probably with some mistakes/inaccuracies. As I browse through the PAPI/libpfm source, I see that the correct description is in .pme_short_desc field, but the .pme_long_desc fields are both the same. And papi_native_avail reports only the long one:
Thanks ... Very fu**ing useful!

Thanks for patience. Summing the stuff like this helped me a lot and I hope it will help somebody struggling with similar issues.

what is `__GI_memset`? why does it cost so much CPU resource?

what is __GI_memset?

It's an internal alias for memset.

why does it cost so much CPU resource?

Because you call it a lot, or because you give it a lot of memory set to some value.

Judging by your next most expensive symbol cv::icvCvt_BGR2RGB_8u_C3R, you are doing some kind of image processing, and possibly are allocating cleared images.

One common mistake is to allocate a cleared image and immediately set it something else (thus wasting the time spent clearing it). But there is not enough info here to deduce whether you are doing that here.

perf mem error event 'cpu/mem-stores/P' not supported

Profiling store memory accesses (Precise Store) is available on Sandy Bridge and later. So it's not supported on your CPU. However, load profiling is supported as the output of the tool indicates.

By default, both loads and stores are profiled. But because Precise Store is not supported on your CPU, the tool emits an error. So you can profile loads only by passing the -t load switch.

Logging all memory accesses of any executable/process in Linux

It is just impossible both to have fastest possible run of Spec and all memory accesses (or cache misses) traced in this run (using in-system tracers). Do one run for timing and other run (longer,slower), or even recompiled binary for memory access tracing.

You may start from short and simple program (not the ref inputs of recent SpecCPU, or billion mem accesses in your big programs) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. There is perf mem tool or you may try some PEBS-enabled events of memory subsystem. PEBS is enabled by adding :p and :pp suffix to the perf event specifier perf record -e event:pp, where event is one of PEBS events. Also try pmu-tools ocperf.py for easier intel event name encoding and to find PEBS enabled events.

Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory performance tests. Check worst case of memory recording overhead at left part on the Arithmetic Intensity scale of [Roofline model](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Typical tests from this part are: STREAM (BLAS1), RandomAccess (GUPS) and memlat are almost SpMV; many real tasks are usually not so left on the scale:

STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).

Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?

Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes (the memory address, sometimes: instruction address) to be recorded to the same memory. So, having memory tracing enabled (more than 10% or memory access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.

Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"

PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.

PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:

http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:

Instrumentation Overhead: Instrumentation involves
injecting extra code dynamically or statically into the
target application. The additional code causes an
application to spend extra time in executing the original
application ... Additionally, for multi-threaded
applications, instrumentation can modify the ordering of
instructions executed between different threads of the
application. As a result, IDS with multi-threaded
applications comes at the lack of some fidelity

Lack of Speculation: Instrumentation only observes
instructions executed on the correct path of execution. As
a result, IDS may not be able to support wrong-path ...

User-level Traffic Only: Current binary instrumentation
tools only support user-level instrumentation. Thus,
applications that are kernel intensive are unsuitable for
user-level IDS.

Good resources on how to program PEBS (Precise event based sampling) counters?

Please, don't mix tracing and timing measurements in single run.

It is just impossible both to have fastest run of Spec and all memory accesses traced. Do one run for timing and other (longer,slower) for memory access tracing.

In https://github.com/pyrovski/powertools the frequency of collected events is controlled by reset_val argument of pebs_init:

https://github.com/pyrovski/powertools/blob/0f66c5f3939a9b7b88ec73f140f1a0892cfba235/msr_pebs.c#L72

void
pebs_init(int nRecords, uint64_t *counter, uint64_t *reset_val ){
    // 1. Set up the precise event buffering utilities.
    //  a.  Place values in the
    //      i.   precise event buffer base,
    //      ii.  precise event index
    //      iii. precise event absolute maximum,
    //      iv.  precise event interrupt threshold,
    //      v.   and precise event counter reset fields
    //      of the DS buffer management area.
    //
    // 2.  Enable PEBS.  Set the Enable PEBS on PMC0 flag 
    //  (bit 0) in IA32_PEBS_ENABLE_MSR.
    //
    // 3.  Set up the IA32_PMC0 performance counter and 
    //  IA32_PERFEVTSEL0 for an event listed in Table 
    //  18-10.

    // IA32_DS_AREA points to 0x58 bytes of memory.  
    // (11 entries * 8 bytes each = 88 bytes.)

    // Each PEBS record is 0xB0 byes long.
...
    pds_area->pebs_counter0_reset       = reset_val[0];
    pds_area->pebs_counter1_reset       = reset_val[1];
    pds_area->pebs_counter2_reset       = reset_val[2];
    pds_area->pebs_counter3_reset       = reset_val[3];
...

    write_msr(0, PMC0, reset_val[0]);
    write_msr(1, PMC1, reset_val[1]);
    write_msr(2, PMC2, reset_val[2]);
    write_msr(3, PMC3, reset_val[3]);

This project is library to access PEBS, and there are no examples of its usage included in project (as I found there is only one disabled test in other projects by tpatki).

Check intel SDM Manual Vol 3B (this is the only good resource for PEBS programming) for meaning of the fields and PEBS configuration and output:
https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-734.html

18.15.7 Processor Event-Based Sampling

PEBS permits the saving of precise architectural information associated with one or more performance events in the precise event records buffer, which is part of the DS save area (see Section 17.4.9, “BTS and DS Save Area”).
To use this mechanism, a counter is configured to overflow after it has counted a preset number of events. After the counter overflows, the processor copies the current state of the general-purpose and EFLAGS registers and instruction pointer into a record in the precise event records buffer. The processor then resets the count in the performance counter and restarts the counter. When the precise event records buffer is nearly full, an interrupt is generated, allowing the precise event records to be saved. A circular buffer is not supported for precise event
records.
... After the PEBS-enabled counter has overflowed, PEBS
record is recorded

(So, reset value is probably negative, equal to -1000 to get every 1000th event, -10 to get every 10th event. Counter will increment and PEBS is written at counter overflow.)

and https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-656.html 18.4.4 Processor Event Based Sampling (PEBS) "Table 18-10" - only L1/L2/DTLB misses have PEBS event in Intel Core. (Find PEBS section for your CPU and search for memory events. PEBS-capable events are really rare.)

So, to have more event recorded you probably want to set reset part of this function to smaller absolute value, like -50 or -10. With PEBS this may work (and try perf -e cycles:upp -c 10 - don't ask to profile kernel with so high frequency, only user-space :u and ask for precise with :pp and ask for -10 counter with -c 10. perf has all PEBS mechanics implemented both for MSR and for buffer parsing).

Another good resource for PMU (hardware performance monitoring unit) are also from Intel, PMU Programming Guides. They have short and compact description both of usual PMU and PEBS too. There is public "Nehalem Core PMU", most of it still useful for newer CPUs - https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf (And there are uncore PMU guides: E5-2600 Uncore PMU Guide, 2012 https://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf)

External pdf about PEBS: https://www.blackhat.com/docs/us-15/materials/us-15-Herath-These-Are-Not-Your-Grand-Daddys-CPU-Performance-Counters-CPU-Hardware-Performance-Counters-For-Security.pdf#page=23 PMCs: Setting Up for PEBS - from "Black Hat USA 2015 - These are Not Your Grand Daddy's CPU Performance Counters"

You may start from short and simple program (not the ref inputs of recent SpecCPU) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. PEBS is used with perf by adding :p and :pp suffix to the event specifier record -e event:pp. Also try pmu-tools ocperf.py for easier intel event name encoding.

Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory tests like (worst case of memory recording overhead, left part on the Arithmetic Intensity scale of Roofline model - STREAM is BLAS1, GUPS and memlat are almost SpMV; real tasks are usually not so left on the scale):

STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).

Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?

Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes to be recorded to the memory. So, having memory tracing enabled (more than 10% or mem.access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.

PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.

PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:

http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:

Instrumentation Overhead: Instrumentation involves
injecting extra code dynamically or statically into the
target application. The additional code causes an
application to spend extra time in executing the original
application ... Additionally, for multi-threaded
applications, instrumentation can modify the ordering of
instructions executed between different threads of the
application. As a result, IDS with multi-threaded
applications comes at the lack of some fidelity

Lack of Speculation: Instrumentation only observes
instructions executed on the correct path of execution. As
a result, IDS may not be able to support wrong-path ...

User-level Traffic Only: Current binary instrumentation
tools only support user-level instrumentation. Thus,
applications that are kernel intensive are unsuitable for
user-level IDS.

imread returns None, violating assertion !_src.empty() in function 'cvtColor' error

This error happened because the image didn't load properly. So you have a problem with the previous line cv2.imread. My suggestion is :

check if the image exists in the path you give
check if the count variable has a valid number

Difference Between "Cpu/Mem-Loads/Pp" and "Cpu/Mem-Loads/"

What is the difference between cpu/mem-loads/pp and cpu/mem-loads/?

What does this sentence mean in the context of perf tool: Supports address when precise (Precise event)?

What is the difference between PM_DATA_ALL* and PM_DATA* events on Power8?

what is `__GI_memset`? why does it cost so much CPU resource?

perf mem error event 'cpu/mem-stores/P' not supported

Logging all memory accesses of any executable/process in Linux

Good resources on how to program PEBS (Precise event based sampling) counters?

Please, don't mix tracing and timing measurements in single run.

imread returns None, violating assertion !_src.empty() in function 'cvtColor' error

Related Topics

Leave a reply