Using Perf Probe to Monitor Performance Stats During a Particular Function

Using perf probe to monitor performance stats during a particular function

I think that the instructions you are following are not yet included into the mainline Linux kernel. As a consequence, perf is telling you that the events are not supported: perf doesn't know the "toggle" mechanism mentioned on this page.

I can see two workarounds:

If you have access to the source code you want to profile you can use the perf_event_open system call directly from your source code to start and stop counting on function entry and exit.
Clone jolsa repository git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/jolsa/perf switch the core_toggle branch git co remotes/origin/perf/core_toggle and then compile and run the kernel with this support.

Regarding 2, I am not familiar at all with kernel versions and development and I think that this solution may be quie complex to use and maintain. Maybe you should ask on the perf users mailing list if there are any plans for the toggle feature to be integrated into the mainline kernel.

Monitoring performance counters during execution of a specific function

I could only find the implementation of the toggle events feature in the /perf/core_toggle repo, which is maintained by the developer of the feature. You can probably compile that code and play with the feature yourself. You can find examples on how to use it here. However, I don't think it has been accepted yet in the main Linux repo for any version of the kernel.

If you want to measure the number of one or more events, then there are alternatives that are easy to use, but require adding a few lines of code to your codebase. You can programmatically use the perf interface or other third-party tools that offer such APIs such as PAPI and LIKWID.

Perf probe event for C variable assignment

Yes this is possible with hardware breakpoint events. perf record supports this if you know the address:

a hardware breakpoint event in the form of \mem:addr[/len][:access]
where addr is the address in memory you want to break in. Access is
the memory access type (read,
write, execute) it can be passed as follows: \mem:addr[:[r][w][x]]. len is the range, number of bytes from
specified addr, which the breakpoint will cover. If you want to
profile read-write accesses in 0x1000, just set mem:0x1000:rw. If you want to profile write accesses in [0x1000~1008),
just set mem:0x1000/8:w.

It may be difficult to get the memory address before hand. You can also use perf_event_open inside your program, but then you need to parse the perf sample records in your program.

Measure the time to reach the main function using perf?

First, you have to consider that perf doesn't really measure time - it records events. Now you can do some profiling and look at call stacks and derive some information about initialization, but in order to measure a specific time, we need to record the beginning and end timestamp.

In case of the time to reach the main function, we can use

1) A dynamic tracepoint on main:

$ sudo perf probe -x ./gctor main Added new event:   probe_gctor:main  (on main in ./gctor)

You can now use it in all perf tools, such as:

perf record -e probe_gctor:main -aR sleep

This does require pretty high privileges, I'll just use root in the example.

2) A sensible point for the "start" of your binary.

I suggest the tracepoint syscalls:sys_exit_execve. This is basically right after perf record started to execute your binary. This works in my version (5.3.7) - if it doesn't for you, you may need to tinker around. You could of course just use -e cycles, but then you get spammed later on with events you don't want.

Putting it together:

sudo perf record -e probe_gctor:main -e syscalls:sys_exit_execve ./gctor
                  ^ this is what perf probe told you earlier

And then look at it with perf script --header

# time of first sample : 77582.919313
# time of last sample : 77585.150377
# sample duration :   2231.064 ms
[....]
# ========
#
           gctor 238828 [007] 77582.919313: syscalls:sys_exit_execve: 0x0
           gctor 238828 [001] 77585.150377:         probe_gctor:main: (5600ea33414d)

You can either compute it from these two samples, or use the sample duration if there are really only the two samples in your trace.

For completeness: Here's a way to do it with gdb:

gdb ./gctor -ex 'b main' -ex 'python import time' -ex 'python ts=time.time()' -ex 'run' -ex 'python print(time.time()-ts)'

This is much less accurate, has about 100 ms overhead on my system, but it doesn't require higher privileges. You could of course improve on this by just building your own runner with fork/ptrace/exec in C.

Getting running time (or other stats) for C Program using perf or otherwise

There are lots of performance analysis tools (which provide information about running time, memory consumption) for C and C++ programs, some of which are,

Valgrind
Google Perf Tools

Hope this is what you are looking for!

Understanding Linux perf FP counters and computation of FLOPS in a C++ program

The normal way for C++ compilers to do FP math on x86-64 is with scalar versions of SSE instructions, e.g. addsd xmm0, [rdi] (https://www.felixcloutier.com/x86/addsd). Only legacy 32-bit builds default to using the x87 FPU for scalar math.

If your compiler didn't manage to auto-vectorize anything (e.g. you didn't use g++ -O3 -march=native), and the only math you do is with double not float, then all the math operations will be done with scalar-double instructions.

Each such instruction will be counted by the fp_arith_inst_retired.double, .scalar, and .scalar-double events. They overlap, basically sub-filters of the same event. (FMA operations count as two, even though they're still only one instruction, so these are FLOP counts not uops or instructions).

So you have 4,493,140,957 FLOPs over 65.86 seconds.

4493140957 / 65.86 / 1e9 ~= 0.0682 GFLOP/s, i.e. very low.

If you had had any counts for 128b_packed_double, you'd multiply those by 2. As noted in the perf list description: "each count represents 2 computation operations, one for each element" because a 128-bit vector holds two 64-bit double elements. So each count for this even is 2 FLOPs. Similarly for others, follow the scale factors described in the perf list output, e.g. times 8 for 256b_packed_single.

So you do need to separate the SIMD events by type and width, but you could just look at .scalar without separating single and double.

See also FLOP measurement, one of the duplicates of FLOPS in Python using a Haswell CPU (Intel Core Processor (Haswell, no TSX)) which was linked on your previous question

(36.37%) is how much of the total time that even was programmed on a HW counter. You used more events than there are counters, so perf multiplexed them for you, swapping every so often and extrapolating based on that statistical sampling to estimate the total over the run-time. See Perf tool stat output: multiplex and scaling of "cycles".

You could get exact counts for the non-zero non-redundant events by leaving out the ones that are zero for a given build.

Using Perf Probe to Monitor Performance Stats During a Particular Function