How to count number of executed instructions of a process id including all future child threads
The combination of perf record -s
and perf report -T
should give you the information you need.
To demonstrate, take the following example code using threads with well-defined instruction counts:
#include <cstdint>
#include <thread>
void work(int64_t count) {
for (int64_t i = 0; i < count; i++);
}
int main() {
std::thread first(work, 100000000ll);
std::thread second(work, 400000000ll);
std::thread third(work, 800000000ll);
first.join();
second.join();
third.join();
}
(Compile without optimization!)
Now, use perf record
as a prefix command. It will follow all spawned processes and threads.
$ perf record -s -e instructions -c 1000000000 ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data (5 samples) ]
To display the statistics nicely:
$ perf report -T
[... snip ...]
# PID TID instructions:u
270682 270683 500003888
270682 270684 2000001866
270682 270685 4000002177
The parameters for perf record
are a little bit tricky. -s
writes separate records with fairly precise numbers - they do not depend on the instruction samples (generated every 1000000000 instructions). However, perf report
, even with -T
fails when it does not find a single sample. So you need to set a instruction sample count -c
(or frequency) that triggers at least once. Any sample will do, it does not need a sample per thread.
Alternatively, you could look at the raw records from perf.data
. Then you can actually tell perf record
to not collect any samples.
$ perf record -s -e instructions -n ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data ]
But you need to filter out the relevant records and there might be additional records you need to sum up.
$ perf script -D | grep PERF_RECORD_READ | grep -v " 0$"
# Annotation by me PID TID
213962455637481 0x760 [0x40]: PERF_RECORD_READ: 270887 270888 instructions:u 500003881
213963194850657 0x890 [0x40]: PERF_RECORD_READ: 270887 270889 instructions:u 2000001874
213964190418415 0x9c0 [0x40]: PERF_RECORD_READ: 270887 270890 instructions:u 4000002175
How to count number of executed instructions of a process id including child processes
You can get the PID of one of you processes (the parent) and deduce the others using pgrep
.
pgrep
has a neat feature --ns
which will get you all the processes running in the same PID namespace as a given PID.
Having that you can get all the child process and convert them to comma separated values and feed them to perf
$ perf stat -p $(pgrep --ns <pid> | paste -s -d ",") -e instructions,cycles,task-clock docker exec -it c7457f74536b curl 127.0.0.1:30005/workload/cpu
pgrep --ns
will get you the pid and paste -s -d ","
will convert them.
How do I sample all threads and record their thread id with perf?
As it turns out, perf record
already records threads and their ID. What got me confused is that the thread ID of the main thread is equal to the process ID. I also must have been doing something wrong when doing the -F -tid
test, because indeed the column with the thread ID disappears.
Benchmarking - How to count number of instructions sent to CPU to find consumed MIPS
perf stat --all-user ./my_program
on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g.
3,496,129,612 instructions:u # 2.61 insn per cycle
It calculates IPC for you; this is usually more interesting than instructions per second. uops
per clock is usually even more interesting in terms of how close you are to maxing out the front-end, though. You can manually calculate MIPS from instructions
and task-clock
. For most other events perf prints a comment with a per-second rate.
(If you don't use --all-user
, you can use perf stat -e task-clock:u,instructions:u
, ... to have those specific events count in user-space only, while other events can count always, including inside interrupt handlers and system calls.)
But see How to calculate MIPS using perf stat for more detail on instructions / task-clock
vs. instructions / elapsed_time
if you do actually want total or average MIPS across cores, and counting sleep or not.
For an example output from using it on a tiny microbenchmark loop in a static executable, see Can x86's MOV really be "free"? Why can't I reproduce this at all?
How can I get real-time information at run-time
Do you mean from within the program, to profile only part of it? There's a perf API where you can do perf_event_open
or something. Or use a different library for direct access to the HW perf counters.
perf stat
is great for microbenchmarking a loop that you've isolated into a stand-alone program that just runs the hot loop for a second or so.
Or maybe you mean something else. perf stat -I 1000 ... ./a.out
will print counter values every 1000 ms (1 second), to see how program behaviour changes in real time with whatever time window you want (down to 10ms intervals).
sudo perf top
is system-wide, slightly like Unix top
There's also perf record --timestamp
to record a timestamp with each event sample. perf report -D
might be useful along with this. See http://www.brendangregg.com/perf.html, he mentions something about -T
(--timestamp
). I haven't really used this; I mostly isolate single loops I'm tuning into a static executable I can run under perf stat
.
And is it possible to find the type of instruction set (add, compare, in, jump, etc)?
Intel x86 CPUs at least have a counter for branch instructions, but other types aren't differentiated, other than FP instructions. This is probably common to most architectures that have perf counters at all.
For Intel CPUs, there's ocperf.py, a wrapper for perf
with symbolic names for more microarchitectural events. (Update: plain perf
now knows the names of most uarch-specific counters so you don't need ocperf.py
anymore.)
perf stat -e task_clock,cycles,instructions,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.scalar_double,uops_executed.x87 ./my_program
It's not designed to tell you what instructions are running, you can already tell that from tracing execution. Most instructions are fully pipelined, so the interesting thing is which ports have the most pressure. The exception is the divide/sqrt unit: there's a counter for arith.divider_active
: "Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations". The divider isn't fully pipelined, so a new divps
or sqrtps
can't always start even if no older uops are ready to execute on port 0. (http://agner.org/optimize/)
Related: linux perf: how to interpret and find hotspots for using perf
to identify hotspots. Especially using top-down profiling you have perf
sample the call-stack to see which functions make a lot of expensive child calls. (I mention this in case that's what you really wanted to know, rather than instruction mix.)
Related:
- How do I determine the number of x86 machine instructions executed in a C program?
- How to characterize a workload by obtaining the instruction type breakdown?
- How do I monitor the amount of SIMD instruction usage
For exact dynamic instruction counts, you might use an instrumentation tool like Intel PIN, if you're on x86. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.
perf stat
counts for the instructions:u
hardware even should also be more or less exact, and is in practice very repeatable across runs of the same program doing the same work.
On recent Intel CPUs, there's HW support for recording which way conditional / indirect branches went, so you can reconstruct exactly which instructions ran in which order, assuming no self-modifying code and that you can still read any JIT buffers. Intel PT.
Sorry I don't know what the equivalents are on AMD CPUs.
How to wait for a number of threads to complete?
You put all threads in an array, start them all, and then have a loop
for(i = 0; i < threads.length; i++)
threads[i].join();
Each join will block until the respective thread has completed. Threads may complete in a different order than you joining them, but that's not a problem: when the loop exits, all threads are completed.
Profile a process via its child and kill the child afterwards
You should send a SIGINT
instead of a SIGKILL
in order to allow perf
to shutdown cleanly and produce a valid output file. The synchronization between the perf
child process and the main process will still be imperfect - so if the main process doesn't take significant time as in your example, it is easily possible that no output file is generated at all. This also affects the accuracy of collected data. With the setup of using perf
as a child process rather than vice-versa, you cannot really improve it.
Related Topics
Top Command First Iteration Always Returns the Same Result
How to Configure a Specific Ip in Prometheus Yml Configuration File
Can Any One Say How to Install the Slack in Ubuntu14.04 and While Trying This Was the Error
Get Last Parameter on Shell Script
How to Initialize the Attribute Group Correctly for a Platform Driver
Block Device Information Without Mounting in Linux
How to Kill a Process on No Output for Some Period of Time
Zipping Without Creating Parent Folder
Intellij Idea 2017.2 Can't Add Openjk 9 on Linux Mint 18
How to Set Dt_Rpath or Dt_Runpath
Difference Between Trap Flag (Tf) and Monitor Trap Flag
Linux/Module.H No Such File or Directory
How to Uniquely Identify an Usb-Device
How to Set a Dynamic Variable in Haproxy