Linux Perf Record: Difference Between Count (-C) and Frequency (-F) Options

linux perf record: difference between count (-c) and frequency (-F) options

Count and frequency are two fundamental switches that tune the rate of sampling when using perf record (which does sampling internally).

Count

When you run perf record -c <number>, you are specifying the sample period (where "number" is the sample period). That is, for every "number"th occurrence of the event a sample will be recorded. The sample will be recorded when the performance counter that keeps track of the number of events has overflowed.

I am guessing you are obtaining the number of events with the help of perf report. Note that perf report will never report the actual number of events, but only an approximate. The number of events will keep changing as you keep tweaking the sample period. perf report will only read the perf.data file that perf record generates, and based on the size of the file generated, it makes an assumption of the number of samples recorded (by knowing the size of a sample recorded in memory). The actual number of events recorded is obtained by -

Number of events = Fixed Sample Period * Number of samples collected

where Fixed Sample Period is what you specified with perf record -c.

Frequency

This is the other way around to express the sampling period, that is to specify the average rate of samples per second (frequency) - which you can do using perf record -F. So perf record -F 1000 will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period to make sure that the sampling process adheres to the sampling frequency.

This is how the sample period gets updated dynamically.

Higher the sampling frequency, higher the number of samples collected (almost proportionately).

The variation in the sampling period can be seen by running the command -

sudo perf report -D -i perf.data | fgrep RECORD_SAMPLE

Whenever the sampling period keeps varying, the total number of events will keep incrementing with the variation in the sampling period. And when the sampling period remains fixed, the total number of events remain fixed and is obtained by the formula showed above. The total number of events will be approximate in both the cases.

What is the default behavior of perf record?

The default event is cycles, as can be seen by running perf script after perf record. There, you can also see that the default sampling behavior is time-based, since the number of cycles is not constant. The default frequency is 4000 Hz, which can be seen in the source code and checked by comparing the file size or number of samples to a recording where -F 4000 was specified.

The perf wiki says that the rate is 1000 Hz, but this is not true anymore for kernels newer than 3.4.

Perf Stat vs Perf Record

First of all, your test case of using sleep and page-faults is not an ideal test case. There should be no page fault events during the sleep duration, you you can't really expect anything interesting. For the sake of easier reasoning I suggest to use the ref-cycles (hardware) event and a busy workload such as awk 'BEGIN { while(1){} }'.

Question 1: It is my understanding that perf stat gets a "summary" of
counts but when used with the -I option gets the counts at the
specified millisecond interval. With this option does it sum up the
counts over the interval or get the average over the interval, or
something else entirely? I assume it is summed up.

Yes. The values are just summed up. You can confirm that by testing:

$ perf stat -e ref-cycles -I 1000 timeout 10s awk 'BEGIN { while(1){} }'
#           time             counts unit events
 1.000105072      2,563,666,664      ref-cycles                                                  
 2.000267991      2,577,462,550      ref-cycles                                                  
 3.000415395      2,577,211,936      ref-cycles                                                  
 4.000543311      2,577,240,458      ref-cycles                                                  
 5.000702131      2,577,525,002      ref-cycles                                                  
 6.000857663      2,577,156,088      ref-cycles                                                  

[ ... snip ... ]
[ Note that it may not be as nicely consistent on all systems due dynamic frequency scaling ]

$ perf stat -e ref-cycles -I 3000 timeout 10s awk 'BEGIN { while(1){} }' 
#           time             counts unit events
 3.000107921      7,736,108,718      ref-cycles                                                  
 6.000265186      7,732,065,900      ref-cycles                                                  
 9.000372029      7,728,302,192      ref-cycles

Question 2: Why doesn't perf stat -e <event1> -I 1000 sleep 5 give
about the same counts as if I summed up the counts over each second
for the following command perf record -e <event1> -F 1000 sleep 5?

perf stat -I is in milliseconds, whereas perf record -F is in HZ (1/s), so the corresponding command to perf stat -I 1000 is perf record -F 1. In fact with our more stable event/workload, this looks better:

$ perf stat -e ref-cycles -I 1000 timeout 10s awk 'BEGIN { while(1){} }'
#           time             counts unit events
 1.000089518      2,578,694,534      ref-cycles                                                  
 2.000203872      2,579,866,250      ref-cycles                                                  
 3.000294300      2,579,857,852      ref-cycles                                                  
 4.000390273      2,579,964,842      ref-cycles                                                  
 5.000488375      2,577,955,536      ref-cycles                                                  
 6.000587028      2,577,176,316      ref-cycles                                                  
 7.000688250      2,577,334,786      ref-cycles                                                  
 8.000785388      2,577,581,500      ref-cycles                                                  
 9.000876466      2,577,511,326      ref-cycles                                                  
10.000977965      2,577,344,692      ref-cycles                                                  
10.001195845            466,674      ref-cycles    

$ perf record -e ref-cycles -F 1 timeout 10s awk 'BEGIN { while(1){} }'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.008 MB perf.data (17 samples) ]

$ perf script -F time,period        
3369070.273722:          1 
3369070.273755:          1 
3369070.273911:       3757 
3369070.273916:    3015133 
3369070.274486:          1 
3369070.274556:          1 
3369070.274657:       1778 
3369070.274662:    2196921 
3369070.275523: 47192985748 
3369072.663696: 2578692405 
3369073.663547: 2579122382 
3369074.663609: 2580015300 
3369075.664085: 2579873741 
3369076.664433: 2578638211 
3369077.664379: 2578378119 
3369078.664175: 2578166440 
3369079.663896: 2579238122

So you see, eventually the results are stable also for perf record -F. Unfortunately the documentation of perf record is very poor. You can learn what the settings -c and -F mean by looking at the documentation of the underlying system call man perf_event_open:

sample_period, sample_freq A "sampling" event is one that
generates an overflow notification every N events, where N is given by
sample_period. A sampling event has sample_period > 0. When
an overflow occurs, requested data is recorded in the mmap buffer.
The sample_type field controls what data is recorded on each
overflow.

sample_freq can be used if you wish to use frequency rather than
period. In this case, you set the freq flag. The kernel will
adjust the sampling period to try and achieve the desired rate. The
rate of adjustment is a timer tick.

So while perf stat uses an internal timer to read the value of the counter every -i milliseconds, perf record sets an event overflow counter to take a sample every -c events. That means it takes a sample every N events (e.g. every N page-fault or cycles). With -F, it it tries to regulate this overflow value to achieve the desired frequency. It tries different values and tunes it up/down accordingly. This eventually works for counters with a stable rate, but will get erratic results for dynamic events.

Frequency-based sampling of multiple threads with perf record

Disclamer

I am not a expert on this topic, but I found this question very interesting, so I tried to come up with an answer. Take this answer with a grain of salt. Corrections are welcome -- and maybe Cunningham's law will get us better answers.

What `cycles` maps to

According to the perf wiki, on Intel, perf uses the UNHALTED_CORE_CYCLES event.

From the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4, 18.2.1 Architectural Performance Monitoring Version 1

Configuration facilities and counters are not shared between logical processors sharing a processor core.

For AMD, the perf wiki, states that the CPU_CLK_UNHALTED hardware event is used. I couldn't find this event in the current Preliminary Processor Programming Reference (PPR) for AMD Family 19h Model 01h, Revision B1 Processors Volume 2 of 2, but I found this in section 2.1.17.1:

There are six core performance event counters per thread, six performance events counters per L3 complex and four Data
Fabric performance events counters

I would conclude that the processors support tracking the cycles event per logical core, and I would assume it to be similar on ARM and other architectures (otherwise, I think the performance counters would be a lot less useful)

What perf does

Now perf has different sampling modes:

The perf tool can be used to count events on a per-thread, per-process, per-cpu or system-wide basis. In per-thread mode, the counter only monitors the execution of a designated thread. When the thread is scheduled out, monitoring stops. When a thread migrated from one processor to another, counters are saved on the current processor and are restored on the new one.

and

By default, perf record operates in per-thread mode, with inherit mode enabled.

From these sources, I would expect the following behavior from perf:

When a thread starts executing on a core, the performance counter is reset
As the thread runs, whenever the counter overflows, a sample is taken
If the thread stops executing, the monitoring stops

Your questions

So, I would conclude that

Is there a global "cycles" counter that samples whatever threads are running at the time when the overflow occurs? Or does each CPU have its own "cycles" counter that samples the thread that it is currently running, and if yes, does "each CPU" mean logical or physical cores?

Each logical core has its own counter.

Or is it a counter per thread?

It is a hardware counter on the cpu core, but perf allows you to use it as if it were per thread -- if a thread of a different program gets scheduled, this should not have any impact on you. By default, perf does not annotate thread information to the samples stored in perf.data. According to the man page, you can use -s or --stat to store this information. Then, perf report will allow you to analyze individual threads of your application.

Are only cycles counted that were spent running the program?

Yes, unless specified otherwise.

Your output

tid  timestamp        event counter
5881 187296.210979:   15736902 cycles:
5881 187296.215945:   15664720 cycles:
5881 187296.221356:   15586918 cycles:
5881 187296.227022:          1 cycles:
5881 187296.227032:          1 cycles:
5881 187296.227037:         62 cycles:
5881 187296.227043:       6902 cycles:
5881 187296.227048:     822728 cycles:
5881 187296.231842:   90947120 cycles:

What did you execute to get this output? Maybe I'm misinterpreting, but I would guess that the following happened:

The points here are partly invalidated by the experiment below

You recorded with a specified target-frequency. That means perf tries to optimize the current overflow value of the hardware counter such that you get as many cycles overflows per second as you specified.
For the first three timestamps, threads of your program were executed on the CPU, this triggered high cycles counts. perf took samples approximately every 0.005s.
Then, it looks like your threads were not executed for that many cpu cycles per second anymore. Maybe it was waiting for IO operations most of its time?^* Thus, the next sample was taken after 0.006s and the cycles count dropped to one. perf noticed that the actual sampling frequency had dropped, so it decremented the overflow threshold with the idea to keep the sampling rate stable.
Then, maybe the IO operation was finished and your threads started running for more cpu cycles per second again. This caused lots of cycles events, but with the lower overflow threshold, perf now took a sample after fewer events (after 0.00001s and 0.000005s for the next 3 samples). Perf incremented the overflow threshold back up during this period.
For the last sample, it seems to have arrived back at around 0.005s distance between samples

Experiment

I think the following might give more insights. Let's create a small, easy to understand workload:

int main() {
    volatile unsigned int i = 0;
    while(1) {
        i++;
    }
}

gcc compiles the loop to four instructions: memory load, increment, memory store, jump. This utilizes one logical core, according to htop, just as you'd expect. I can simulate that it stopped executing as if it was waiting for IO or not scheduled, by simply using ctrl+Z in the shell to suspend it.

Now, we run

perf record -F 10 -p $(pidof my_workload)

let it run for a moment. Then, use ctrl+Z to suspend execution. Wait for a moment and then use fg to resume execution. After a few seconds, end the program.

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0,021 MB perf.data (65 samples) ]

In my case, perf record wrote 65 samples. We can use perf script to inspect the sample data written and get (full output, because I think I might accidentally remove something important. This was on an Intel i5-6200U, Ubuntu 20.04, kernel 5.4.0-72-generic):

     my_workload 831622 344935.025844:          1 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
     my_workload 831622 344935.025847:          1 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
     my_workload 831622 344935.025849:       2865 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
     my_workload 831622 344935.025851:   16863383 cycles:  ffffffffa12016f2 nmi_restore+0x25 ([kernel.kallsyms])
     my_workload 831622 344935.031948: 101431200645 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344935.693910:  269342015 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344935.794233:  268586235 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344935.893934:  269806654 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344935.994321:  269410272 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344936.094938:  271951524 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344936.195815:  269543174 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344936.295978:  269967653 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344936.397041:  266160499 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344936.497782:  265215251 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344936.596074:  269863048 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344936.696280:  269857624 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344936.796730:  269274440 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344936.897487:  269115742 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344936.997988:  266867300 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344937.097088:  269734778 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344937.196955:  270487956 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344937.297281:  270136625 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344937.397516:  269741183 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344943.438671:  173595673 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344943.512800:  251467821 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344943.604016:  274913168 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344943.703440:  276448269 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344943.803753:  275059394 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344943.903362:  276318281 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344944.005543:  266874454 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344944.105663:  268220344 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344944.205213:  269369912 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344944.305541:  267387036 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344944.405660:  266906130 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344944.506126:  266194513 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344944.604879:  268882693 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344944.705588:  266555089 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344944.804896:  268419478 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344944.905269:  267700541 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344945.004885:  267365839 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344945.103970:  269655126 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344945.203823:  269330033 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344945.304258:  267423569 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344945.403472:  269773962 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344945.504194:  275795565 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344945.606329:  271013012 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344945.703866:  277537908 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344945.803821:  277559915 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344945.904299:  277242574 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344946.005167:  272430392 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344946.104424:  275291909 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344946.204402:  275331204 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344946.304334:  273818645 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
     my_workload 831622 344946.403674:  275723986 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344946.456065:          1 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
     my_workload 831622 344946.456069:          1 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
     my_workload 831622 344946.456071:       2822 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
     my_workload 831622 344946.456072:   17944993 cycles:  ffffffffa0673596 native_write_msr+0x6 ([kernel.kallsyms])
     my_workload 831622 344946.462714: 107477037825 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344946.804126:  281880047 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
     my_workload 831622 344946.907508:  274093449 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344947.007473:  270795847 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344947.106277:  275006801 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344947.205757:  274972888 cycles:      558f36233178 main+0xf (/tmp/my_workload)
     my_workload 831622 344947.305405:  274436774 cycles:      558f3623317b main+0x12 (/tmp/my_workload)

I think we can see two main things in this output

The sample at 344937.397516 seems to be the last sample before I suspended the program and 344943.438671 seems to be the first sample after it resumed. We a have a little lower cycles count here. Apart from that, it looks just like the other lines.
However, your pattern can be found directly after starting -- this is expected I'd say -- and at timestamp 344946.456065. With the annotation native_write_msr I think what we observe here is perf doing internal work. There was this question regarding what native_write_msr does. According to the comment of Peter to that question, this is the kernel programming hardware performance counters. It's still a bit strange. Maybe, after writing out the current buffer to perf.data, perf behaves just as if it was just started?

^* As Jérôme pointed out in the comments, there can be many reasons why less cycles events happened. I'd guess your program was not executed because it was sleeping or waiting for IO or memory access. It's also possible that your program simply wasn't scheduled to run by the OS during this time.

If you're not measuring a specific workload, but your general system, it may also happen that the CPU reduces clock rate or goes into a sleep state because it has no work to do. I assumed that you probably did something like perf record ./my_program with my_program being a CPU intensive workload, so it think it was was unlikely that the cpu decided to sleep.

Getting accurate time measurement with `perf-stat`

There are a variety of reasons you can see variation when you repeatedly benchmark what appears to be the same code. I have covered some of the reasons in another answer and it would be worthwhile to keep those in mind.

However, based on experience and playing the probabilities, we can eliminate many of those up front. What's left are the most likely causes of your relatively large deviations for short programs from a cold start:

CPU power saving and frequency scaling features.
Actual runtime behavior differences, i.e., different code executed in the runtime library, VM, OS or other supporting infrastructure each time your run your program.
Some caching effect, or code or data alignment effect that varies from run to run.

You can probably separate these three effects with a plain perf stat without overriding the event list, like:

$ perf stat true

 Performance counter stats for 'true':

          0.258367      task-clock (msec)         #    0.427 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                41      page-faults               #    0.159 M/sec                  
           664,570      cycles                    #    2.572 GHz                    
           486,817      instructions              #    0.73  insn per cycle         
            92,503      branches                  #  358.029 M/sec                  
             3,978      branch-misses             #    4.30% of all branches        

       0.000605076 seconds time elapsed

Look first at the 2.572 GHz line. This shows the effective CPU frequency, calculating by dividing the true number of CPU cycles by the task-clock value (CPU time spent by the program). If this varies from run to run, the wall-clock time performance deviation is partly or completely explained by this change, and the most likely cause is (1) above, i.e., CPU frequency scaling, including both scaling below nominal frequency (power saving), and above (turbo boost or similar features).

The details of disabling frequency scaling depends on the hardware, but a common one that works on most modern Linux distributions is cpupower -c all frequency-set -g performance to inhibit below-nominal scaling.

Disabling turbo boost is more complicated and may depend on the hardware platform and even the specific CPU, but for recent x86 some options include:

Writing 0 to /sys/devices/system/cpu/intel_pstate/no_turbo (Intel only)
Doing a wrmsr -p${core} 0x1a0 0x4000850089 for each ${core} in your system (although one on each socket is probably enough on some/most/all chips?). (Intel only)
Adjust the /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq value to set a maximum frequency.
Use the userspace governor and /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed to set a fixed frequency.

Another option is to simply run your test repeatedly, and hope that the CPU quickly reaches a steady state. perf stat has built-in support for that with the --repeat=N option:

   -r, --repeat=<n>
       repeat command and print average + stddev (max: 100). 0 means forever.

Let's say you observe that the frequency is always the same (within 1% or so), or you have fixed the frequency issues but some variance remains.

Next, check the instructions line. This is a rough indicator of how much total work your program is doing. If it varies in the same direction and similar relative variance to your runtime variance, you have a problem of type (2): some runs are doing more work than others. Without knowing what your program is, it would be hard to say more, but you can use tools like strace, perf record + perf annotate to track that down.

If instructions doesn't vary, and frequency is fixed, but runtime varies, you have a problem of type (3) or "other". You'll want to look at more performance counters to see which correlate with the slower runs: are you having more cache misses? More context switches? More branch mispredictions? The list goes on. Once you find out what is slowing you down, you can try to isolate the code that is causing it. You can also go the other direction: using traditional profiling to determine what part of the code slows down on the slow runs.

Good luck!

Performance monitoring with perf

ANSWER #1

Yes mostly. perf report does show you a summary of the trace collected. Samples collected by perf record are saved into a binary file called, by default, perf.data. The perf report command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. However, you can do much more detailed profiling also using this report.

ANSWER #2

You should ideally use perf script -D to get a trace of all data. The timestamp is in microseconds. Although, in kernels newer than the one you specify, with the help of a command line switch (-ns) you can display the time in nanoseconds as well. Here is the source -

Timestamp

It is quite hard to tell this without looking at what kind of "deltas" are you getting. Remember the period of collecting samples is usually tuned. There are two ways of specifying the rate at which to collect samples --

You can use the perf record (--c for count) to specify the period at which to collect samples. This will mean that for every c occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values. This means that at every two occurences of the event for which you are measuring, the counter will overflow and you will record a sample.

The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F. So perf record -F 1000 will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period. And you will get sample times at different random moments.

You can see for yourself in code here:

How perf dynamically updates time

ANSWER #3

Why not ? Ideally you should get the number of event samples collected if you do a perf report and just do a deeper analysis. Also when you do a perf record and finish recording samples, you would get a notification on the command line about the number of samples collected corresponding to the event you measured.

Linux Perf Record: Difference Between Count (-C) and Frequency (-F) Options