Intel-PT does not record any packets when KVM-QEMU is on
Since I now have a clear idea of why Intel-PT does not work with QEMU-KVM on, I will post an answer.
As I mentioned in the question, the main reason for this not to work is the fact that the bit 14 for the value in the MSR MSR_IA32_VMX_MISC
is 0, for my processor. As per the Intel documentation, this bit should be 1 for Intel-PT to be used in VMX root operation(between VMXON and VMXOFF).
The main problem is that when the above bit is 0, a VMXON instruction will set the TraceEn component of IA32_RTIT_CTL
MSR to 0. This component controls the tracing operation, if this is reset, no tracing data is written to the buffer. This reset is controlled at the hardware level.
To perform this activity, it is necessary to have a Skylake processor, at least. I was using a Broadwell system, which, as it looks now, will not work.
What is the sampling rate for intel_pt event i.e., perf record -e intel_pt//?
Yes, intel_pt mode of perf record
is different and is not same sampling (statistical) profiling with software (cpu-clock) or hardware (cycles) events. Sampling has 4000 of current EIP samples per second and gives you basic inexact view over code execution. intel_pt is hardware-based tracing technique which generates a lot of data about every control flow instruction (in default perf intel_pt mode) allowing to reconstruct full control flow, but it has bigger overhead. So, frequency of Intel PT is same as how many calls, branches and returns are executed per second by program code (100s of millions).
With sampling on hardware events, perf record
will ask hardware PMU to count some events like CPU cycles, and to generate an overflow interrupt after for example 2 million of such events. On such interrupt perf_events subsystem in kernel will record current OS timestamp, pid/tid of current thread, EIP instruction pointer to ring buffer and reset the PMU counter for new value. perf subsystem does limit maximum frequency of interrupts by autotuning the value, and -F
option can be used to change desired frequency of interrupts. When the ring buffer (around several megabytes in size) is filled, perf
user-space tool will dump it contents into perf.data
file, and you can view raw data with perf script
or perf script -D
. Or just to make histograms with perf report
(sort EIPs by how often there was an interrupt on that EIP instruction address, which is proportional to time taken by that code). This mode has around 4 thousand events per second of thread execution (perf report --header | grep sample_freq
), with 48 bytes per sample, or 192 kilobyte per second. Overhead is basically low enough, but the sampling is not exact.
perf wiki has separate page for intel processor trace (intel_pt) - https://perf.wiki.kernel.org/index.php/Perf_tools_support_for_Intel%C2%AE_Processor_Trace
Control flow tracing is different from other kinds of performance analysis and debugging. It provides fine-grained information on branches taken in a program, but that means there can be a vast amount of trace data. Such an enormous amount of trace data creates a number of challenges, but it raises the central question: how to reduce the amount of trace data that needs to be captured. That inverts the way performance analysis is normally done. Instead of taking a test case and creating a trace of it, you need first to create a test case that is suitable for tracing.
So, intel_pt is tracing (logging) module integrated into CPU hardware, and when armed it will generate "hundreds of megabytes of trace data per CPU per second", according to used settings. With some settings it may event generate tracing data (packet log) faster than it can be written to disk or even to RAM ("overflow packets"). According to https://lwn.net/Articles/648154/ article, perf_events (kernel-mode) in intel_pt mode will just save full packet log into separate (bigger?) ring buffer and perf tool (user-space) will just periodically save data from ring buffer into file for offline filtering, parsing and decode. (Period of saving aux or ring mmap into the file is not the same as overflow interrupt frequency option -F
) PT decoder then will be used to reconstruct PT packet log into perf-compatible samples. Log data volume is huge, overhead is 1% - 5% - 10% or more depending on branch frequency in code executed.
Documentation of intel_pt is manpage man perf-intel-pt
and long text stored inside linux kernel source code at
https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/perf-intel-pt.txt
Intel PT is first supported in Intel Core M and 5th generation Intel Core
processors that are based on the Intel micro-architecture code name Broadwell.
Trace data is collected by 'perf record' and stored within the perf.data file. ... Trace data must be 'decoded' which involves walking the object code and matching the trace data packets. ... Decoding is done on-the-fly. The decoder outputs samples in the same format as
samples output by perf hardware events, for example as though the "instructions"
or "branches" events had been recorded. Presently 3 tools support this:
'perf script', 'perf report' and 'perf inject'. ... The main distinguishing feature of Intel PT is that the decoder can determine
the exact flow of software execution. Intel PT can be used to understand why
and how did software get to a certain point, or behave a certain way. ... A limitation of Intel PT is that it produces huge amounts of trace data
(hundreds of megabytes per second per core) which takes a long time to decode
By default perf record -e intel_pt//
is same as -e intel_pt/tsc=1,noretcomp=0/
. config terms
section of manpage man perf-intel-pt
says what is default settings:
tsc
Always supported.
Produces TSC timestamp packets to provide timing information. In some cases it is possible to decode without timing information, for example a per-thread context that does not overlap executable memory maps.
noretcomp
Always supported. Disables "return compression" so a TIP
packet is produced when a function returns. Causes more packets to be
produced but might make decoding more reliable.
pt
Specifies pass-through which enables the branch config term.
branch
Enable branch tracing. Branch tracing is enabled by defaultTo represent software control flow, "branches" samples are produced.
By default a branch sample is synthesized for every single branch.
As it says, intel_pt in default mode is used to produce control flow log, by asking hardware to generate log packets for every control flow instruction like call, branch, return, and to add timestamps to synchronize pt log with some service perf samples (like exec or mmap to find actual code being loaded into memory). It tries to generate not too much, for example [single bit is used per conditional branch (tnt)](https://conference.hitb.org/hitbsecconf2017ams/materials/D1T1 - Richard Johnson - Harnessing Intel Processor Trace on Windows for Vulnerability Discovery.pdf#page=12) and several bytes per indirect branch, but there are hundreds of millions branches per second for many programs.
Some useful and short slides on perf + intel_pt:
- Andi Kleen, 2015 https://halobates.de/pt-tracing-summit15.pdf (PT modes current: Full trace mode, Snapshot mode; Upcoming: Sampling mode, Core dump, System crash mode)
- Andi Kleen's posts on PT: https://halobates.de/blog/p/category/pt
- Suchakrapani Datt Sharma, POLYTECHNIQUE MONTREAL, 2015 https://hsdm.dorsal.polymtl.ca/system/files/10Dec2015_0.pdf (trace packets overview - PSB (Packet Stream Boundary), TNT (Taken Not-Taken), TIP (Target IP) at branches, non-default CYC Packets : Cycle counter data for IPC, MTC (Mini Timestamp Counter), ...)
- Jack Henschel, 2017 about design and use-cases https://blog.cubieserver.de/publications/Henschel_Intel-PT_2017.pdf
- [https://events.static.linuxfound.org/sites/events/files/slides/lcna13_kleen.pdf Efficient and Large Scale Program Flow Tracing in Linux, Alexander Shishkin], Intel, 2013 ("What is it good for? •Profiling / performance measurement •Functional debugging •Code coverage analysis")
- About generic difference between sampling and (software) tracing: https://danluu.com/perf-tracing/
Update: While intel pt trace log has full trace (there are packets inside for every branch/call/return), perf report
does run conversion from pt log into sample set like in classic perf.data, and there is sampling rate in sample set. This is configured with --itrace
option of perf report
(iNNTT, where NN is amount and TT is type - i/t/us/ns, as described in man page of perf-report:
--itrace
Options for decoding instruction tracing data. The options are:
i synthesize instructions events
g synthesize a call chain (use with i or x)
The default is all events i.e. the same as --itrace=ibxwpe,
In addition, the period (default 100000, ...)
for instructions events can be specified in units of:
i instructions
t ticks
ms milliseconds
us microseconds
ns nanoseconds (default)
So it seems like by default perf report
will convert full trace log into instruction samples at sampling rate of 100000 instructions (1 perf sample generated per 100 thousands instructions). It can be changed to higher rate, but processing time will increase.
Manpage of perf-intel-pt gives more examples of itrace option usage:
Because samples are synthesized after-the-fact, the sampling period
can be selected for reporting. e.g. sample every microsecond
sudo perf report pt_ls --itrace=i1usge
See the sections below for more information about the --itrace
option.
Beware the smaller the period, the more samples that are produced,
and the longer it takes to process them.
Also note that the coarseness of Intel PT timing information will
start to distort the statistical value of the sampling as the
sampling period becomes smaller.
To see every possible IPC value, "instructions" events can be used
e.g. --itrace=i0ns
--itrace=i10us
sets the period to 10us i.e. one instruction sample is synthesized
for each 10 microseconds of trace. Alternatives to "us" are "ms"
(milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
(instructions).
For Intel PT, the default period is 100us.
Setting it to a zero period means "as often as possible".
In the case of Intel PT that is the same as a period of 1 and a unit
of instructions (i.e. --itrace=i1i).
http://halobates.de/blog/p/410 has some additional examples of complex conversions:
perf script --ns --itrace=cr
Record program execution and display function call graph.
perf script by defaults “samples” the data (only dumps a sample every
100us). This can be configured using the --itrace option (see
reference below)perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64
Show every assembly instruction executed with disassembler.
perf report --itrace=g32l64i100us --branch-history
Print hot paths every 100us as call graph histograms
perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
google-chrome workload.svgGenerate flame graph from execution, sampled every 100us
Intel VT-x: How NMI is delivered to guest OS
This is a partial answer to the question. I can describe what the processor does when an NMI occurs, but I don't know what KVM does.
If the NMI Exiting control is 0 and an NMI arrives while in VMX-non-root mode, the NMI is delivered to the guest via the guest's IDT.
If the NMI Exiting control is 1, an NMI causes a VM exit. [Intel SDM, volume 3, section 24.6.1, table 24-5]
Probably KVM sets this control to 1. In this case, the processor does not automatically process the NMI. It is up to KVM how to handle it when the VM exit occurs. It may deliver the NMI to the host through the host IDT or it may inject it into the guest.
libvirtd with qemu: ryzen cpu emulation on intel host?
Ryzen is in the same family as EPYC, so you want CPU model name "EPYC" / "EPYC-IBPB" - see also https://www.qemu.org/docs/master/system/target-i386.html#recommendations-for-kvm-cpu-model-configuration-on-x86-hosts
That said, if the VM is running on a host with Intel CPUs, you are not going to be able to pick a EPYC CPU model for it, because that is not capable of running on an Intel host due to mis-matched features.
How and when host CPU state is saved in the VMCS host-state area?
The CPU never saves the host state.
The VMM (aka: the hypervisor) controls when to execute vmlaunch/vmresume
and can thus set the host state area accordingly before their execution.
When a VM-entry fails due to an invalid VMCS, the execution falls through to the next instruction after vmlaunch/vmresume
.
When the VM-entry fails due to an invalid guest state, the execution resumes from the RIP
set in the host state area (just like a VM-exit occurred).
If the CPU were to set the host state area, the two cases will be identical.
This is also why the CPU checks the host state area before entering VMX non-root mode (i.e. launching a VM).
Related Topics
Redirecting Output of a C Program to Another C Program with a Bash Script Under Linux
Linux Allocates Memory at Specific Physical Address
How to Two Mmap on Same /Dev File
Copying Local Git Config into Docker Container
Openldap Naming Context Issue with Apache Directory Studio
Brother Ql-720Nw Specifying Media Size Seems Ignored
How to Programmatically Know If I Am in a Vm
How to Extract Value from JSON Contained in a Variable Using Jq in Bash
Use Awk to Print $0 Using The Same Format for All Columns
Suppressing Compile Time Linkage of Shared Libraries
Is There Some Cases in Which Sigkill Will Not Work
Save The Result of Ls Command in The Remote Sftp Server on Local Machine
Simpler Way to Repeatedly Read Lines and Invoke a Program
Search Ip from a Text File in .Csv Log File, If Found Add New Column Next to It