Perf Startup Overhead: Why Does a Simple Static Executable Which Performs Mov + Sys_Exit Have So Many Stalled Cycles (And Instructions)

Big difference in overhead caused by instructions in straight-line code

One possible reason is that these instructions happen to be pointed to by the instruction pointer when a sample is taken. A typical x86 CPU can retire up to 4 instructions per cycle, but when it does so and a sample is token, the program counter will point to just one instruction, not all those four.

Here is an example - see below. Simple plain loop with a bunch of nop instructions. Note how clockticks distribute over this profile with exactly three instructions in the gaps. This may be similar to the effect you are seeing.

Alternatively, it could be that mov rcx,QWORD PTR [rdi+0x210] and mov eax,DWORD PTR [rdi+0x1f4] often miss the cache with the cycles spent on that being attributed to the next instruction, like see here.


       │    Disassembly of section .text:
       │
       │    00000000004004ed :
       │      push   %rbp
       │      mov    %rsp,%rbp
       │      movl   $0x0,-0x4(%rbp)
       │    ↓ jmp    25
 14.59 │ d:   nop
       │      nop
       │      nop
  0.03 │      nop
 14.58 │      nop
       │      nop
       │      nop
  0.08 │      nop
 13.89 │      nop
       │      nop
  0.01 │      nop
  0.08 │      nop
 13.99 │      nop
       │      nop
  0.01 │      nop
  0.05 │      nop
 13.92 │      nop
       │      nop
  0.01 │      nop
  0.07 │      nop
 14.44 │      addl   $0x1,-0x4(%rbp)
  0.33 │25:   cmpl   $0x3fffffff,-0x4(%rbp)
 13.90 │    ↑ jbe    d
       │      pop    %rbp
       │    ← retq

Why does it take so many instructions to run an empty program?

It's hardly fair to claim that it "does literally nothing". Yes, at the app level you chose to make the whole thing a giant no-op for your microbenchmark, that's fine. But no, down beneath the covers at the system level, it's hardly "nothing". You asked linux to fork off a brand new execution environment, initialize it, and connect it to the environment. You called very few glibc functions, but dynamic linking is non-trivial and after a million instructions your process was ready to demand fault printf() and friends, and to efficiently bring in libs you might have linked against or dlopen()'ed.

This is not the sort of microbench that implementors are likely to optimize against. What would be of interest is if you can identify "expensive" aspects of fork/exec that in some use cases are never used, and so might be #ifdef'd out (or have their execution short circuited) in very specific situations. Lazy evaluation of resolv.conf is one example of that, where the overhead is never paid by a process if it never interacts with IP servers.

Perf Stat vs Perf Record

First of all, your test case of using sleep and page-faults is not an ideal test case. There should be no page fault events during the sleep duration, you you can't really expect anything interesting. For the sake of easier reasoning I suggest to use the ref-cycles (hardware) event and a busy workload such as awk 'BEGIN { while(1){} }'.

Question 1: It is my understanding that perf stat gets a "summary" of
counts but when used with the -I option gets the counts at the
specified millisecond interval. With this option does it sum up the
counts over the interval or get the average over the interval, or
something else entirely? I assume it is summed up.

Yes. The values are just summed up. You can confirm that by testing:

$ perf stat -e ref-cycles -I 1000 timeout 10s awk 'BEGIN { while(1){} }'
#           time             counts unit events
 1.000105072      2,563,666,664      ref-cycles                                                  
 2.000267991      2,577,462,550      ref-cycles                                                  
 3.000415395      2,577,211,936      ref-cycles                                                  
 4.000543311      2,577,240,458      ref-cycles                                                  
 5.000702131      2,577,525,002      ref-cycles                                                  
 6.000857663      2,577,156,088      ref-cycles                                                  

[ ... snip ... ]
[ Note that it may not be as nicely consistent on all systems due dynamic frequency scaling ]

$ perf stat -e ref-cycles -I 3000 timeout 10s awk 'BEGIN { while(1){} }' 
#           time             counts unit events
 3.000107921      7,736,108,718      ref-cycles                                                  
 6.000265186      7,732,065,900      ref-cycles                                                  
 9.000372029      7,728,302,192      ref-cycles

Question 2: Why doesn't perf stat -e <event1> -I 1000 sleep 5 give
about the same counts as if I summed up the counts over each second
for the following command perf record -e <event1> -F 1000 sleep 5?

perf stat -I is in milliseconds, whereas perf record -F is in HZ (1/s), so the corresponding command to perf stat -I 1000 is perf record -F 1. In fact with our more stable event/workload, this looks better:

$ perf stat -e ref-cycles -I 1000 timeout 10s awk 'BEGIN { while(1){} }'
#           time             counts unit events
 1.000089518      2,578,694,534      ref-cycles                                                  
 2.000203872      2,579,866,250      ref-cycles                                                  
 3.000294300      2,579,857,852      ref-cycles                                                  
 4.000390273      2,579,964,842      ref-cycles                                                  
 5.000488375      2,577,955,536      ref-cycles                                                  
 6.000587028      2,577,176,316      ref-cycles                                                  
 7.000688250      2,577,334,786      ref-cycles                                                  
 8.000785388      2,577,581,500      ref-cycles                                                  
 9.000876466      2,577,511,326      ref-cycles                                                  
10.000977965      2,577,344,692      ref-cycles                                                  
10.001195845            466,674      ref-cycles    

$ perf record -e ref-cycles -F 1 timeout 10s awk 'BEGIN { while(1){} }'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.008 MB perf.data (17 samples) ]

$ perf script -F time,period        
3369070.273722:          1 
3369070.273755:          1 
3369070.273911:       3757 
3369070.273916:    3015133 
3369070.274486:          1 
3369070.274556:          1 
3369070.274657:       1778 
3369070.274662:    2196921 
3369070.275523: 47192985748 
3369072.663696: 2578692405 
3369073.663547: 2579122382 
3369074.663609: 2580015300 
3369075.664085: 2579873741 
3369076.664433: 2578638211 
3369077.664379: 2578378119 
3369078.664175: 2578166440 
3369079.663896: 2579238122

So you see, eventually the results are stable also for perf record -F. Unfortunately the documentation of perf record is very poor. You can learn what the settings -c and -F mean by looking at the documentation of the underlying system call man perf_event_open:

sample_period, sample_freq A "sampling" event is one that
generates an overflow notification every N events, where N is given by
sample_period. A sampling event has sample_period > 0. When
an overflow occurs, requested data is recorded in the mmap buffer.
The sample_type field controls what data is recorded on each
overflow.

sample_freq can be used if you wish to use frequency rather than
period. In this case, you set the freq flag. The kernel will
adjust the sampling period to try and achieve the desired rate. The
rate of adjustment is a timer tick.

So while perf stat uses an internal timer to read the value of the counter every -i milliseconds, perf record sets an event overflow counter to take a sample every -c events. That means it takes a sample every N events (e.g. every N page-fault or cycles). With -F, it it tries to regulate this overflow value to achieve the desired frequency. It tries different values and tunes it up/down accordingly. This eventually works for counters with a stable rate, but will get erratic results for dynamic events.

Why do kill dependency instructions consume reservation station slots?

They don't, but AFAIK there is no unfused-domain front-end counter. If you don't have branch mispredicts that cause uops to be discarded from the RS after issue/before exec, it doesn't matter where in the pipeline you count so there is a workaround.

To count RS uops, use uops_executed.thread which counts uops that have successfully(?) executed. I haven't checked if replays of eagerly-dispatched uops count uops_executed on every attempted dispatch, or only on uops_dispatched_port.port_[0..7].

See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of using perf to sort out eliminated vs. non-eliminated, and front-end fused domain vs. back-end unfused domain.

I just measure the number of microoperations getting into the RS with the event uops_issued.any

That event counts fused-domain uops issued into the ROB. It counts 1 for micro-fused uops like add eax, [rdi] or mov al, [rsi] that merges into the low half of RAX (even though those count 2 uops_executed), and it counts 1 for eliminated uops like mov reg,reg and xor same,same (0 uops_executed).

perf list does misleadingly describe it like this (on Skylake) so the confusion is understandable.

uops_issued.any
[Uops that Resource Allocation Table (RAT) issues to Reservation
Station (RS)]

I always thought that instructions for killing dependencies, e.g xor reg, reg do not have to be executed and are ready for retirement as soon as the Renamer moves them to the Re-order Buffer.

Yes, that's what I think, too, that they enter the ROB marked as already executed, and don't touch the RS.

Only Sandybridge-family does this (including Skylake/IceLake); other microarchitectures (like Zen AFAIK) do need a back-end uop to actually write the zero. What is the best way to set a register to zero in x86 assembly: xor, mov or and?

AMD does do mov-elimination for vector moves (since Bulldozer) and GP-integer moves since Zen, so those are presumably handled like Intel xor-zeroing or mov.

One guess at the mechanism on Sandybridge is that xor-zeroing (of GP-integer or XMM/YMM registers) renames onto an internal zero register. http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ tested this and it xor-zeroing instructions don't consume an extra PRF entry for writing the destination register.

Benchmarking - How to count number of instructions sent to CPU to find consumed MIPS

perf stat --all-user ./my_program on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g.

3,496,129,612      instructions:u            #    2.61  insn per cycle

It calculates IPC for you; this is usually more interesting than instructions per second. uops per clock is usually even more interesting in terms of how close you are to maxing out the front-end, though. You can manually calculate MIPS from instructions and task-clock. For most other events perf prints a comment with a per-second rate.

(If you don't use --all-user, you can use perf stat -e task-clock:u,instructions:u , ... to have those specific events count in user-space only, while other events can count always, including inside interrupt handlers and system calls.)

But see How to calculate MIPS using perf stat for more detail on instructions / task-clock vs. instructions / elapsed_time if you do actually want total or average MIPS across cores, and counting sleep or not.

For an example output from using it on a tiny microbenchmark loop in a static executable, see Can x86's MOV really be "free"? Why can't I reproduce this at all?

How can I get real-time information at run-time

Do you mean from within the program, to profile only part of it? There's a perf API where you can do perf_event_open or something. Or use a different library for direct access to the HW perf counters.

perf stat is great for microbenchmarking a loop that you've isolated into a stand-alone program that just runs the hot loop for a second or so.

Or maybe you mean something else. perf stat -I 1000 ... ./a.out will print counter values every 1000 ms (1 second), to see how program behaviour changes in real time with whatever time window you want (down to 10ms intervals).

sudo perf top is system-wide, slightly like Unix top

There's also perf record --timestamp to record a timestamp with each event sample. perf report -D might be useful along with this. See http://www.brendangregg.com/perf.html, he mentions something about -T (--timestamp). I haven't really used this; I mostly isolate single loops I'm tuning into a static executable I can run under perf stat.

And is it possible to find the type of instruction set (add, compare, in, jump, etc)?

Intel x86 CPUs at least have a counter for branch instructions, but other types aren't differentiated, other than FP instructions. This is probably common to most architectures that have perf counters at all.

For Intel CPUs, there's ocperf.py, a wrapper for perf with symbolic names for more microarchitectural events. (Update: plain perf now knows the names of most uarch-specific counters so you don't need ocperf.py anymore.)

perf stat -e task_clock,cycles,instructions,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.scalar_double,uops_executed.x87 ./my_program

It's not designed to tell you what instructions are running, you can already tell that from tracing execution. Most instructions are fully pipelined, so the interesting thing is which ports have the most pressure. The exception is the divide/sqrt unit: there's a counter for arith.divider_active: "Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations". The divider isn't fully pipelined, so a new divps or sqrtps can't always start even if no older uops are ready to execute on port 0. (http://agner.org/optimize/)

Related: linux perf: how to interpret and find hotspots for using perf to identify hotspots. Especially using top-down profiling you have perf sample the call-stack to see which functions make a lot of expensive child calls. (I mention this in case that's what you really wanted to know, rather than instruction mix.)

Related:

How do I determine the number of x86 machine instructions executed in a C program?
How to characterize a workload by obtaining the instruction type breakdown?
How do I monitor the amount of SIMD instruction usage

For exact dynamic instruction counts, you might use an instrumentation tool like Intel PIN, if you're on x86. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.

perf stat counts for the instructions:u hardware even should also be more or less exact, and is in practice very repeatable across runs of the same program doing the same work.

On recent Intel CPUs, there's HW support for recording which way conditional / indirect branches went, so you can reconstruct exactly which instructions ran in which order, assuming no self-modifying code and that you can still read any JIT buffers. Intel PT.

Sorry I don't know what the equivalents are on AMD CPUs.