LTTng/Perf: Difference between events used for exiting (sched_process_exit) and freeing (sched_process_free) a process
I find some time to edit my answer to make it more clear. If there are still some problem, please tell me, we can discuss and make it more clear. Let's dive into the end of task :
there are two system calls : exit_group()
and exit()
, and all of them will go to do_exit()
, which will do the following things.
- set
PF_EXTING
which means the task is deleting - remove the task descriptor from timer by
del_timer_sync()
- call
exit_mm(), exit_sem(), __exit_fs()
and others to release structure of that task - call perf_event_exit_task(tsk);
- decrease the ref count
- set
exit_code
to_exit()/exit_group()
or error - call
exit_notify()
- update relationship with parent and child
- check
exit_signal
, sendSIGCHLD
- if task is not traced or return value is -1, set the exit_state to
EXIT_DEAD
, callrelease_task()
to recycle other memory and decrease ref count. - if task is traced, set exit_state to
EXIT_ZOMBIE
- set task flag to
PF_DEAD
- call
schedule()
We need zombie state cause the parent may need to use those file descriptors so we can not delete all the things in the first time. The parent task will need to use something like wait()
to check if child is dead. After wait()
, it is time for the zombie to release totally by release_task()
- decrease the owners' task number
- if the task is traced, delete from the
ptrace_children
list - call
__exit_signal()
delete all pending signals and release signal_struct descriptor andexit_itimers()
delete all the timer - call
__exit_sighand()
delete signal handler - call
__unhash_process()
nr_threads
--- call
detach_pid()
to delete task descriptor fromPIDTYPE_PID
andPIDTYPE_TGID
- call
REMOVE_LINKS
to delete the task from list
- call
sched_exit()
to schedule parent's time pieces - call
put_task-struct()
to decrease the counter, and release memory & task descriptor - call delayed_put_task_struct()
So, we know that sched_process_exit
state will be make in the do_exit(), but we can not make sure if the process is released or not (may call release_task() or not, which will trigger sched_process_free
). That is why we need both of the two perf event point.
Which perf events can use PEBS?
There is hack to support cycles:p
on SandyBridge which has no PEBS for CPU_CLK_UNHALTED.*
. The hack is implemented in the kernel part of perf
in intel_pebs_aliases_snb()
. When user requests -e cycles
which is PERF_COUNT_HW_CPU_CYCLES
(translates to CPU_CLK_UNHALTED.CORE
) with nonzero precise
modifier, this function will change hardware event to UOPS_RETIRED.ALL
with PEBS:
29 [PERF_COUNT_HW_CPU_CYCLES] = 0x003c,
2739 static void intel_pebs_aliases_snb(struct perf_event *event)
2740 {
2741 if ((event->hw.config & X86_RAW_EVENT_MASK) == 0x003c) {
2742 /*
2743 * Use an alternative encoding for CPU_CLK_UNHALTED.THREAD_P
2744 * (0x003c) so that we can use it with PEBS.
2745 *
2746 * The regular CPU_CLK_UNHALTED.THREAD_P event (0x003c) isn't
2747 * PEBS capable. However we can use UOPS_RETIRED.ALL
2748 * (0x01c2), which is a PEBS capable event, to get the same
2749 * count.
2750 *
2751 * UOPS_RETIRED.ALL counts the number of cycles that retires
2752 * CNTMASK micro-ops. By setting CNTMASK to a value (16)
2753 * larger than the maximum number of micro-ops that can be
2754 * retired per cycle (4) and then inverting the condition, we
2755 * count all cycles that retire 16 or less micro-ops, which
2756 * is every cycle.
2757 *
2758 * Thereby we gain a PEBS capable cycle counter.
2759 */
2760 u64 alt_config = X86_CONFIG(.event=0xc2, .umask=0x01, .inv=1, .cmask=16);
2761
2762 alt_config |= (event->hw.config & ~X86_RAW_EVENT_MASK);
2763 event->hw.config = alt_config;
2764 }
2765 }
The intel_pebs_aliases_snb
hack is registered in 3557 __init int intel_pmu_init(void)
for case INTEL_FAM6_SANDYBRIDGE:
/ case INTEL_FAM6_SANDYBRIDGE_X:
as
3772 x86_pmu.event_constraints = intel_snb_event_constraints;
3773 x86_pmu.pebs_constraints = intel_snb_pebs_event_constraints;
3774 x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
pebs_aliases
is called from intel_pmu_hw_config()
when precise_ip
is set to non-zero:
2814 static int intel_pmu_hw_config(struct perf_event *event)
2815 {
2821 if (event->attr.precise_ip) {
2828 if (x86_pmu.pebs_aliases)
2829 x86_pmu.pebs_aliases(event);
2830 }
The hack was implemented in 2012, lkml threads "[PATCH] perf, x86: Make cycles:p working on SNB", "[tip:perf/core] perf/x86: Implement cycles:p for SNB/IVB", cccb9ba9e4ee0d750265f53de9258df69655c40b, http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=cccb9ba9e4ee0d750265f53de9258df69655c40b:
perf/x86: Implement cycles:p for SNB/IVB
Now that there's finally a chip with working PEBS (IvyBridge), we can
enable the hardware and implement cycles:p for SNB/IVB.
And I think, there is no full list of such "precise" converting hack besides the linux source code in arch/x86/events/intel/core.c
, grep for static void intel_pebs_aliases
(usually cycles:p
/ CPU_CLK_UNHALTED 0x003c
is implemented) and check intel_pmu_init
for actual model and exact x86_pmu.pebs_aliases
variant selected:
- intel_pebs_aliases_core2,
INST_RETIRED.ANY_P (0x00c0) CNTMASK=16
instead ofcycles:p
- intel_pebs_aliases_snb,
UOPS_RETIRED.ALL (0x01c2) CNTMASK=16
instead ofcycles:p
- intel_pebs_aliases_precdist for highest values of
precise_ip
,INST_RETIRED.PREC_DIST (0x01c0)
instead ofcycles:ppp
on SKL, IVB, HSW, BDW
What is the meaning of Perf events: dTLB-loads and dTLB-stores?
When virtual memory is enabled, the virtual address of every single memory access needs to be looked up in the TLB to obtain the corresponding physical address and determine access permissions and privileges (or raise an exception in case of an invalid mapping). The dTLB-loads
and dTLB-stores
events represent a TLB lookup for a data memory load or store access, respectively. The is the perf
definition of these events. but the exact meaning depends on the microarchitecture.
On Westmere, Skylake, Kaby Lake, Coffee Lake, Cannon Lake (and probably Ice Lake), dTLB-loads
and dTLB-stores
are mapped to MEM_INST_RETIRED.ALL_LOADS
and MEM_INST_RETIRED.ALL_STORES
, respectively. On Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Goldmont, Goldmont Plus, they are mapped to MEM_UOP_RETIRED.ALL_LOADS
and MEM_UOP_RETIRED.ALL_STORES
, respectively. On Core2, Nehalem, Bonnell, Saltwell, they are mapped to L1D_CACHE_LD.MESI
and L1D_CACHE_ST.MESI
, respectively. (Note that on Bonnell and Saltwell, the official names of the events are L1D_CACHE.LD
and L1D_CACHE.ST
and the event codes used by perf
are only documented in the Intel manual Volume 3 and not in other Intel sources on performance events.) The dTLB-loads
and dTLB-stores
events are not supported on Silvermont and Airmont.
On all current AMD processors, dTLB-loads
is mapped to LsDcAccesses
and dTLB-stores
is not supported. However, LsDcAccesses
counts TLB lookups for both loads and stores. On processors from other vendors, dTLB-loads
and dTLB-stores
are not supported.
See Hardware cache events and perf for how to map perf
core events to native events.
The dTLB-loads
and dTLB-stores
event counts for the same program on different microarchitectures can be different not only because of differences in the microarchitectures but also because the meaning of the events is itself different. Therefore, even if the microarchitectural behavior of the program turned out to be the same on the microarchitectures, the event counts can still be different. A brief description of the native events on all Intel microarchitectures can be found here and a more detailed description on some of the microarchitectures can be found here.
Related: how to interpret perf iTLB-loads,iTLB-load-misses.
Weird Backtrace in Perf
TL;DR perf backtracing process may stop at some function if there is no frame pointer saved in the stack or no CFI tables for dwarf method. Recompile libraries with -fno-omit-frame-pointer
or with -g
or get debuginfo. With release binaries and libs perf often will stop backtrace early without chance to reach main()
or _start
or clone()/start_thread()
top functions.
perf
profiling tool in Linux is statistical sampling profiler (without binary instrumentation): it programs software timer or event source or hardware performance monitoring unit (PMU) to generate periodic interrupt. In your example-c 10000 -e mem_load_uops_retired.l3_miss:uppp
is used to select hardware PMU in x86_64 in some kind of PEBS mode (https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR) to generate interrupt after 10000 of mem_load_uops_retired (with l3_miss mask). Generated interrupt is handled by Linux Kernel (perf_events subsystem, kernel/events and arch/x86/events). In this handler PMU is reset (reprogrammed) to generate next interrupt after 10000 more events and sample is generated. Sample data dump is saved into perf.data file by perf report
command, but every wake of tool can save thousands of samples; samples can be read by perf script
or perf script -D
.
perf_events interrupt handler, something near __perf_event_overflow
of kernel/events/core.c, has full access to the registers of current function, and has some time to do additional data retrieval to record current time, pid, etc. Part of such process is https://en.wikipedia.org/wiki/Call_stack data collection. But with x86_64 and -fomit-frame-pointer (often enabled for many system libraries of Debian/Ubuntu/others) there is no default place in registers or in function stack to store frame pointers:
https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/Optimize-Options.html#index-fomit_002dframe_002dpointer-692
-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and
restore frame pointers; it also makes an extra register available in
many functions. It also makes debugging impossible on some machines.Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86
targets has been changed to -fomit-frame-pointer. The default can be
reverted to -fno-omit-frame-pointer by configuring GCC with the
--enable-frame-pointer configure option.
With frame pointers saved in the function stack backtracing/unwinding is easy. But for some functions modern gcc (and other compilers) may not generate frame pointer. So backtracing code like in perf_events handler either will stop backtrace at such function or needs another method of frame pointer recovery. Option -g method
(--call-graph
) of perf record
selects the method to be used. It is documented in man perf-record
http://man7.org/linux/man-pages/man1/perf-record.1.html:
--call-graph
Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is "fp".Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI -
Call Frame Information) or "lbr" (Hardware Last Branch Record
facility) as the method to collect the information used to show the
call graphs.In some systems, where binaries are build with gcc
--fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the
libunwind or libdw library) should be used instead. Using the "lbr"
method doesn't require any compiler options. It will produce call
graphs from the hardware LBR registers. The main limitation is that
it is only available on new Intel platforms, such as Haswell. It
can only get user call chain. It doesn't work with branch stack
sampling at the same time.When "dwarf" recording is used, perf also records (user) stack dump
when sampled. Default size of the stack dump is 8192 (bytes). User
can change the size by passing the size after comma like
"--call-graph dwarf,4096".
So, dwarf method reuses CFI tables to find stack frame sizes and find caller's stack frame. I'm not sure are CFI tables stripped from release libraries by default or not; but debuginfo probably will have them. LBR will not help because it is rather short hardware buffer. Dwarf split processing (kernel handler saves part of stack and perf user-space tool will parse it with libdw+libunwind) may lose some parts of call stack, so try also to increase dwarf stack dumps by using --call-graph dwarf,10240
or --call-graph dwarf,81920
etc.
Backtracing is implemented in arch-dependent part of perf_events: arch/x86/events/core.c:perf_callchain_user()
; called from kernel/events/callchain.c:get_perf_callchain()
<- perf_callchain <- perf_prepare_sample <-
__perf_event_output <- *(event->overflow_handler)
<- READ_ONCE(event->overflow_handler)(event, data, regs);
of __perf_event_overflow
.
Gregg did warn about incomplete call stacks of perf: http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html
Incomplete stacks usually mean -fomit-frame-pointer was used – a compiler optimization that makes little positive difference in the real world, but breaks stack profilers. Always compile with -fno-omit-frame-pointer. More recent perf has a -g dwarf option, to use the alternate libunwind/dwarf method for retrieving stacks.
I also did write about backtraces in perf with some additional links: How does linux's perf utility understand stack traces?
Related Topics
Would Gcc 4.8 and 4.7 Peacefully Coexist on The Same Machine
Why a Static Library Can Depend on a Shared a Library
Redirecting Output of a C Program to Another C Program with a Bash Script Under Linux
Git Status Between Windows and Linux Does Not Agree
How Is The Linux Calculating Memfree
Linux Kconfig Command Line Interface
How to Find Grid Points Nearest to Given Location Using Shell Script
Implementation of Syscall() on Arm-Oabi. What Is "Svc #0X900071"
In Shellscript Assign Variable Based on Curl Output
Cuda-Gdb Not Working in Nsight on Linux
How to Two Mmap on Same /Dev File
How to Open Include File 'Io.Mac' Assembly
Auto-Start Program at Login in Angstrom on Beagleboard
Linux History of All Commands Executed During Whole Day, Everyday
How to Simulate a Usb Printer to Lpt on Linux
Relative-To-Executable Path to Ld-Linux Dynamic Linker/Interpreter