Why Does Perf Show That Sleep Takes All Cores

Why does perf show that sleep takes all cores?

According to man page of perf stat subocmmand, you have -a option to profile full system:
http://man7.org/linux/man-pages/man1/perf-stat.1.html

   -a, --all-cpus
       system-wide collection from all CPUs (default if no target is
       specified)

In this "system-wide" mode perf stat (and perf record too) will count events on (or profile for record) all CPUs in the system. When used without additional argument of command, perf will run until interrupted by Ctrl-C. With argument of command, perf will count/profile until the command works. Typical usage is

perf stat -a sleep 10      # Profile counting every CPU for 10 seconds
perf record -a sleep 10    # Profile with cycles every CPU for 10 seconds to perf.data

For getting stats of single command use single process profiling (without -a option)

perf stat python3 test.py

For profiling (perf record) you may run without -a option; or you may use -a and later do some manual filtering in perf report, focusing only on the pids/tids/dsos of your application (This can be very useful if command to profile uses some interprocess requests to other daemons to do lot of CPU work).

--per-core, -A, -C <cpulist>, --per-socket options are only for system-wide -a mode. Try --per-thread with -p pid attach to process option.

Profiling sleep times with perf

There is a error message from your second perf command from https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times - perf inject -s

$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
build id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6
failed to write feature 2

failed to write feature 2 doesn't look too user-friendly...

... but it was added to perf to made errors more user-friendly: http://lwn.net/Articles/460520/ "perf: make perf.data more self-descriptive (v5)" by Stephane Eranian , 22 Sep 2011:

+static int do_write_feat(int fd, struct perf_header *h, int type,  ....
+           pr_debug("failed to write feature %d\n", type);

All features are listed here http://lxr.free-electrons.com/source/tools/perf/util/header.h#L13

 15         HEADER_TRACING_DATA     = 1,
 16         HEADER_BUILD_ID,

So, it sounds like perf inject was not able to write information about build ids (error from function write_build_id() from util/header.c) if I'm not wrong. There are two cases which can lead to error: unsuccessful call to perf_session__read_build_ids() or failing in writing buildid table dsos__write_buildid_table (this is not our case because there is no "failed to write buildid table" error message; check write_build_id)

You may check, do you have all buildids needed for the session. Also it may be useful to clear your buildid cache (rm -rf ~/.debug), and check that you have up-to-date vmlinux with debugging info or kallsyms enabled in your kernel.

UPDATE: in comments Pavel says that his pref record had no any sched:sched_stat_sleep events written to perf.data:

sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out

As he explains in his answer, his default debian kernel have CONFIG_SCHEDSTATS option disabled with vendor's patch. The redhat did the same thing with the option in release kernels since 3.11, and this is explained in Redhat Bug 1013225 (Josh Boyer 2013-10-28, comment 4):

We switched to enabling that only on debug builds a while ago. It seems that was turned off entirely with the final 3.11.0 build and has remained off since. Internal testing shows the option has a non-trivial performance impact for context switches.

We can turn this on in debug kernels again, but I'm not sure it's worthwhile.

Josh Poimboeuf 2013-11-04 in comment 8 says that performance impact is detectable:

In my tests I did a lot of context switches under various CPU loads. I saw a ~5-10% drop in average context switch speed when CONFIG_SCHEDSTATS was enabled. ...The performance hit only seemed to happen on post-CFS kernels (>= 2.6.23). The previous O(1) scheduler didn't seem to have this issue.

Fedora disabled CONFIG_SCHEDSTAT in non-debug kernels at 12 July 2013 "[kernel] Disable LATENCYTOP/SCHEDSTATS in non-debug builds." by Dave Jones. First kernel with disabled option: 3.11.0-0.rc0.git6.4.

In order to use any perf software tracepoint event with name like sched:sched_stat_* (sched:sched_stat_wait, sched:sched_stat_sleep, sched:sched_stat_iowait) we must recompile kernel with CONFIG_SCHEDSTATS option enabled and replace default Debian, RedHat or Fedora kernels which have no this option.

Thank you, Pavel Davydov.

perf reports multiple sched:sched_stat_sleep event for a single sleep

This is a bug in the kernel, which affects also other events (the message from chengjian (D) shows that it affects sched:sched_stat_sleep). Apparently, the event is being counted as many times as there are CPUs (nrcpus). According to this diff, the fix is included in the kernel version 4.18.17 (which ensures that the event is only counted for the CPU on which it has occurred.

Why does sleeping increase the execution time of an independent piece of code that is executed before/afterwards?

I am quite sure that this is caused by the CPU throttling. When the OS scheduler detects there is little work to do, the CPU frequency lowers to save power.

When you do a lot of sleeps, you are telling the scheduler that you are not in such a hurry, and the CPU can take it easy.

You can see that this is the case by running a CPU intensive task in another window, with low priority. For example, in Linux you can run:

$ nice bash -c 'while true ; do : ; done'

And at the same time, in another window run your program:

$ cargo run --release
loop_time 1.00ms    => 3.13ms
loop_time 2.00ms    => 3.17ms
loop_time 3.00ms    => 3.19ms
loop_time 4.00ms    => 3.13ms
loop_time 5.00ms    => 3.16ms
loop_time 6.00ms    => 3.22ms
loop_time 7.00ms    => 3.14ms
loop_time 8.00ms    => 3.15ms
loop_time 9.00ms    => 3.13ms
loop_time 10.00ms   => 3.18ms
loop_time 11.00ms   => 3.14ms
loop_time 12.00ms   => 3.17ms
loop_time 13.00ms   => 3.15ms
...

Avoiding this depends on your OS. For example, in Linux, you can fiddle with sys/devices/system/cpu/* options. I think that UPower provides some functions to manage it from a non-root application. It would be nice if there were a crate that managed this cross-system, but I don't know of any.

An easy but hacky way to fix this, if you don't mind the wasted power, is just to run an idle thread with a busy loop.

std::thread::spawn(|| {
    use thread_priority::*; //external crate thread-priority
    let thread_id = thread_native_id();
    set_thread_priority(
        thread_id,
        ThreadPriority::Min,
        ThreadSchedulePolicy::Normal(NormalThreadSchedulePolicy::Idle),
    )
    .unwrap();
    loop {}
});

Naturally, if you just want to avoid throttling in this piece of code, you can do a busy wait:

    //if loop_time > elapsed {
    //    std::thread::sleep(loop_time - elapsed);
    //}
    // Busy-wait the remaining time, to avoid CPU throttling
    while loop_time > before.elapsed() {
        //you may want to try both with and without yield
        std::thread::yield_now();
    }

Run time and reported cycle counts in linux perf

I have run a single thread matrix multiplication

But then you counted system-wide across all 4 cores for the time when it was running. That's what perf -a does, and why you got Performance counter stats for 'system wide': in your ouput.

So the kernel's task-clock "event" was counting time across all 4 cores whenever your process was active:

68,701.90 msec task-clock # 4.000 CPUs utilized

That's 68.7 CPU-seconds, almost exactly 4x 17.17 s, which makes sense.

I also don't know why 0.462 GHz is written in front of cpu-cycles.

That's the average cycles/time across all 4 cores during the time your code was active. The clock doesn't tick for CPUs that are asleep, so it seems the load average was 4 * 462/800 = 2.31 CPUs not asleep during the time your code was keeping 1 core busy.

So you were benchmarking while your system was running another 1.3 threads on average. (Including CPU time waking up for interrupts)

I'm assuming your measurements were not limited to user-space-only (excluding ticks spent in kernel mode); that's another possible effect.

If you hadn't used -a, a lower frequency than the HW was running at can still happen if you count only user-space cycles, not cycles spent in the kernel. (e.g. perf stat --all-user, or with your kernel.perf_event_paranoid setting high enough to not let you profile HW events in the kernel which effectively applies :u to HW events but without printing them as cycles:u.)

If you change your kernel.perf_event_paranoid sysctl setting to 0, or run perf as root, you should see exactly 800 MHz as cycles / task-clock, since that's the speed the CPU was running when your threads were running.

Spending a lot of cycles not in user-space would indicate that your code is spending a lot of time in the kernel, probably handling page faults if you're using a big array.

Why Does Perf Show That Sleep Takes All Cores