Read and Parse Perf.Data

perf.data to text or csv

You should use perf record -e <event-name> ... to sample events every 1ms. It seems you are trying to read the perf.data file and organize it into human-readable data. You should use perf report if you are not aware of it. The perf report command reads the perf.data file and generates a concise execution profile. The below link should help you -

Sample analysis with perf report

You can modify the perf report output to your requirements. You can also use perf report -F to specify multiple columns in csv format.

However, in addition, perf stat does have a mechanism to collect information in a csv format using the perf stat -x command.

Edit #1:

(I am using Linux-Kernel 4.14.3 for evaluation.)

Since you want the number of events per sample taken, there are couple of things to be noted. To count the number of events per sample, you will need to know the sampling period. The sampling period gives you the number of events after which the performance counter will overflow and the kernel will record a sample. So essentially, in your case,

sampling period = number of events per sample

Now there are two ways of specifying this sampling period. Either you specify it or you do not specify it.

If while doing a perf record, you specify the sampling period.. something like this :-

perf record -e <some_event_name> -c 1000 ...

Here -c 1000 means that the sampling period is 1000. In this case, you purposefully force the system to record 1000 events per sample because the sampling period is fixed by you.

On the other hand, if you do not specify the sampling period, the system will try to record events at a default frequency of 1000 samples/sec. This means that the system will automatically change the sampling period, if need be, to maintain the frequency of 1000 samples/sec. In such a case, to determine the sampling period, you need to observe the perf.data file.

Specifically, you need to open the perf.data file using the command :

perf script -D

The output will very well look like this :-

0 0 0x108 [0x38]: PERF_RECORD_FORK(1:1):(0:0)

0x140 [0x30]: event: 3
.
. ... raw event: size 48 bytes
. 0000: 03 00 00 00 00 00 30 00 01 00 00 00 01 00 00 00 ......0.........
. 0010: 73 79 73 74 65 6d 64 00 00 00 00 00 00 00 00 00 systemd.........
. 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

0 0 0x140 [0x30]: PERF_RECORD_COMM: systemd:1/1

0x170 [0x38]: event: 7
.
. ... raw event: size 56 bytes
. 0000: 07 00 00 00 00 00 38 00 02 00 00 00 00 00 00 00 ......8.........
. 0010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
. 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
. 0030: 00 00 00 00 00 00 00 00 ........

You can see different types of records like PERF_RECORD_FORK and PERF_RECORD_COMM and even PERF_RECORD_MMAP. You need to specifically look out for records that begin with PERF_RECORD_SAMPLE inside the file. Like this:

14 173826354106096 0x10d40 [0x28]: PERF_RECORD_SAMPLE(IP, 0x1): 28179/28179: 0xffffffffa500d3b5 period: 3000 addr: 0
... thread: perf:28179
...... dso: [kernel.kallsyms]
perf 28179 [014] 173826.354106: cache-misses: ffffffffa500d3b5 [unknown] ([kernel.kallsyms])

As you can see, in this case the period is 3000 i.e. number of events collected between the previous sampling event and this sampling event is 3000. (i.e. number of events per sample is 3000) Note that, as I mentioned above this period might be tuned. So you need to collect all of the PERF_RECORD_SAMPLE records from the perf.data file.

How to analyze perf.data of perf sched record

perf sched record is special variant of perf, and I think it is more correct to use perf sched subcommands to analyze resulting perf.data file.

There is man page on perf sched: http://man7.org/linux/man-pages/man1/perf-sched.1.html

       'perf sched record <command>' to record the scheduling events
of an arbitrary workload.

'perf sched latency' to report the per task scheduling latencies
and other scheduling properties of the workload.

'perf sched script' to see a detailed trace of the workload that
was recorded...

'perf sched replay' to simulate the workload that was recorded
via perf sched record. (this is done by starting up mockup threads
that mimic the workload based on the events in the trace. These
threads can then replay the timings (CPU runtime and sleep patterns)
of the workload as it occurred when it was recorded - and can repeat
it a number of times, measuring its performance.)

'perf sched map' to print a textual context-switching outline of
workload captured via perf sched record. Columns stand for
individual CPUs, and the two-letter shortcuts stand for tasks that
are running on a CPU. A '*' denotes the CPU that had the event, and
a dot signals an idle CPU.

Also there is letter from 2009 which describe perf sched functionality: https://lwn.net/Articles/353295/ "[Announce] 'perf sched': Utility to capture, measure and analyze scheduler latencies and behavior" and the recommended usage of perf sched record results is perf sched latency not perf report:

... experimental version of a utility
that tries to meet this ambitious goal: the new 'perf sched' family of
tools that uses performance events to objective characterise arbitrary
workloads from a scheduling and latency point of view.

'perf sched' has five sub-commands currently:

  perf sched record            # low-overhead recording of arbitrary workloads
perf sched latency # output per task latency metrics
perf sched map # show summary/map of context-switching
perf sched trace # output finegrained trace
perf sched replay # replay a captured workload using simlated threads

Desktop users would generally use 'perf sched record' to capture a trace
(which creates perf.data), and 'perf sched latency' tool to check
latencies (which analyzes the trace in perf.data). The other tools too
use an already recorded perf.data.

How do I read and parse a text file with numbers, fast (in C)?

Some suggestions:

a) Consider converting (or pre-processing) the file into a binary format; with the aim to minimise the file size and also drastically reduce the cost of parsing. I don't know the ranges for your values, but various techniques (e.g. using one bit to tell if the number is small or large and storing the number as either a 7-bit integer or a 31-bit integer) could halve the file IO (and double the speed of reading the file from disk) and slash parsing costs down to almost nothing. Note: For maximum effect you'd modify whatever software created the file in the first place.

b) Reading the entire file into memory before you parse it is a mistake. It doubles the amount of RAM required (and the cost of allocating/freeing) and has disadvantages for CPU caches. Instead read a small amount of the file (e.g. 16 KiB) and process it, then read the next piece and process it, and so on; so that you're constantly reusing the same small buffer memory.

c) Use parallelism for file IO. It shouldn't be hard to read the next piece of the file while you're processing the previous piece of the file (either by using 2 threads or by using asynchronous IO).

d) Pre-allocate memory for the "neighbour" structures and remove most/all malloc() calls from your loop. The best possible case is to use a statically allocated array as a pool - e.g. Neighbor myPool[MAX_NEIGHBORS]; where malloc() can be replaced with &myPool[nextEntry++];. This reduces/removes the overhead of malloc() while also improving cache locality for the data itself.

e) Use parallelism for storing values. For example, you could have multiple threads where the first thread handles all the cases where root_level % NUM_THREADS == 0, the second thread handles all cases where root_level % NUM_THREADS == 1, etc.

With all of the above (assuming a modern 4-core CPU), I think you can get the total time (for reading and storing) down to less than 15 seconds.



Related Topics



Leave a reply



Submit