Why Is Cpu-Cycles Much Less Than CPU Current Frequency

CPU clock cycle misunderstanding

Pipelining

Pipelining doesn't involve cycles shorter than a single clock cycle. Here's how pipelining works:

Sample Image

We have a complicated task to do. We take that task and break it down into a number of stages, each of which is relatively simple to carry out. We study the amount of work in each stage to make sure each stage takes about the same amount of time as any other.

With a processor, we do roughly the same thing--but in this case, it's not "install these fourteen bolts", it's things like fetching and decoding instructions, reading operands, executing (often a couple of stages here), and writing back results.

Like the automotive production line, we provide each stage of the pipeline with a specialized set of tools for doing exactly (and only) what is needed at that stage. When we finish doing one stage of processing on a car/instruction, it moves along to the next stage, and this stage gets the next car/instruction to process.

In an ideal situation, the process works (roughly) like this:

Sample Image

It took Ford about 12 hours to build one N car (the predecessor to the model T). Thanks primarily to ~~pipelining~~ the production line, it took only about 2 and a half hours to build a Model T. More importantly, even though a model T took 2.5 hours start to finish, that time was broken down into no fewer than 84 discrete steps, so when everything ran smoothly the production line as a whole could produce another car (about) every two minutes.

That didn't always happen though. If one stage ran short of parts, the stages after it had to wait. If the pause lasted very long, it would back things up so the preceding stages had to wait too.

The same can happen in a processor pipeline. For example, when a branch happens, the processor may have to wait a while before the next instruction can be fetched. If an instruction needs an operand from memory, that can lead to a pause (a "pipeline bubble") as well.

To prevent pauses in his pipeline, Henry Ford hired people to study the stages, figure out how many of each kind of part would need to be on hand for each stage, and so on. I don't know for sure, but I think it's a fair guess that there were probably a few people designated to watch the supply of parts at different stations, and send somebody running to let a warehouse manager know if (for whatever reason) the supply of parts for a particular stage looked like it was running short so they'd need more soon.

Processors do a little of the same thing--they have things like branch predictors and prefetchers that attempt to figure out ahead of time what will be needed by the stream of instructions being executed, and trying to ensure that everything is on hand when its needed (with caches, for example, to temporarily store things that seem likely to be needed soon).

So, like the Model T, it takes some relatively long amount of time for each instruction to execute start to finish, but we get another product finished at much shorter intervals--ideally once a clock (but see my other answer--modern designs often execute more than one instruction per clock).

Should CPU time always be identical between executions of same code?

Your understanding is completely wrong. Real-world computers running modern OSes on modern CPUs are not simple, theoretical abstractions. There are all kinds of factors that can affect how much CPU time code requires to execute.

Consider memory bandwidth. On a typical modern machine, all the tasks running on the machine's cores are competing for access to the system memory. If the code is running at the same time code on another core is using lots of memory bandwidth, that may result in accesses to RAM taking more clock cycles.

Many other resources are shared as well, such as caches. Say the code is frequently interrupted to let other code run on the core. That will mean that the code will frequently find the cache cold and take lots of cache misses. That will also result in the code taking more clock cycles.

Let's talk about page faults as well. The code itself may be in memory or it may not be when the code starts running. Even if the code is in memory, you may or may not take soft page faults (to update the operating system's tracking of what memory is being actively used) depending on when that page last took a soft page fault or how long ago it was loaded into RAM.

And your basic hello world program is doing I/O to the terminal. The time that takes can depend on what else is interacting with the terminal at the time.

Is cycle count itself reliable on program timing?

So you're proposing measuring CPU cycles, instead of seconds? Sounds somewhat reasonable.

For some microbenchmarks that's good, and mostly factors out the variations due to CPU frequency changes. (And delays due to interrupts if you count only user-space cycles, if you're microbenching a loop that doesn't make system calls. Only the secondary effects of interrupts are then visible, i.e. serializing the pipeline and perhaps evicting some of your data from cache / TLB.)

But the memory (and maybe L3 cache) stay at constant speed while CPU frequency changes, so the relative cost of a cache miss changes: The same response time in nanoseconds is fewer core clock cycles, so out-of-order exec can hide more of it more easily. And available memory bandwidth is higher relative to what a core can use. So HW prefetch has an easier time keeping up.

e.g. at 4.3GHz, a load that missed in L2 cache but hits in L3 on Skylake-server might have a total latency of about 79 core clock cycles. (https://www.7-cpu.com/cpu/Skylake_X.html - i7-7820X (Skylake X), 8 cores).

At 800MHz idle clock speed, an L2 cache miss is still 14 cycles (because it runs at core speed). But if another core is keeping the L3 cache (and the uncore in general) at high clock speed, the off-core part of that round-trip request will take many fewer core clock cycles.

e.g. we can make a back-of-the-envelope calculation by assuming that all the extra time for an L3 hit vs. an L2 hit is spent in the uncore, not the core, and takes a fixed number of nanoseconds. Since we have that time in cycles of a 4.3GHz clock, the math works out as 14 + (79-14)*8/43 cycles for an L3 hit at 800MHz = 26 cycles, down from 79.

This rough calculation actually matches up with the 7-cpu.com numbers for the same CPU with a core at 3.6GHz: L3 Cache Latency = 68 cycles. 14 + (79-14)*36/43 = 68.4.

Note that I picked a "server" part because different cores can run at different clock speeds. That's not the case in "client" CPUs like i7-6700k. Uncore (L3, interconnect, etc.) may still be able to vary independently of the cores, e.g. staying high for the GPU. Also, server parts have higher latency outside the core. (e.g. 4GHz Skylake i7-6700k with turbo disabled has L3 latency of only 42 core clock cycles, not 68 or 79.)

See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? for why/how L3 and memory latency affect max possible single-core memory bandwidth.

Of course, if you control the CPU frequency by allowing some warm-up, or for tasks that run for more than a trivial amount of time, this isn't a big deal.

(Although do note that Skylake will sometimes lower the clock speed when very memory-bound, which unfortunately hurts bandwidth even more, at the default energy_performance_preference = balance_power, but "balance_performance" or "performance" can avoid that. Slowing down CPU Frequency by imposing memory stress)

Do note that counting only cycles won't remove the cost of context switches (extra cache misses after switching back to this thread, and draining the ROB sucks). Or of competition from other cores for memory bandwidth.

e.g. another thread running on the other logical core of the same physical core will often seriously reduce IPC. Overall throughput usually goes up some, depending on the task, but individual per-thread throughput goes down.

Skylake has a perf event for tracking hyperthreading competition: cpu_clk_thread_unhalted.one_thread_active - IIRC that event count increments at something like 24MHz when your task is running and has the core all to itself. So if you see less than that, you know you had some competition and spent some time with the ROB partitioned and trading front-end cycles with another thread.

So there are a bunch of effects, and it's up to you to decide whether it's useful. Sorting results by core clock cycles sounds reasonable, but you should probably include CPU-seconds (task-clock) and average-frequency in the results to help people spot outliers / glitches.

PAPI: what does Clock reference cycles mean?

Some modern CPUs, including the Intel's and AMD's ones, are throttled.

This means that their clocks are not fixed but vary depending on the power management active - even if the CPU's brand frequency is X Ghz, more often than not, it is not running at that frequency.

For a couple of real example technology see the Intel Turbo boost technology/AMD Turbo core and Intel Enhanced Speedstep technology/AMD Quiet'n'Cool technology.

Since the core clock can slow down or speed up, comparing two different measures makes no sense.

Having a snippet A to run in 100 core clocks and a snippet B in 200 core clocks means that B is slower in general (it takes double the work), but not necessarily that B took more time than A since the units are different.
That's where the reference clock comes into play - it is uniform.

If snippet A runs in 100 ref clocks and snippet B runs in 200 ref clocks then B really took more time than A.

Converting ref clock ticks into time (e.g. seconds) is not that easy, each processor uses a difference reference frequency, even among processor with the same brand name.

What is a clock cycle and clock speed?

The speed of a computer processor, or CPU, is determined by the Clock Cycle, which is the amount of time between two pulses of an oscillator. Generally speaking, the higher number of pulses per second, the faster the computer processor will be able to process information. The clock speed is measured in Hz, typically either megahertz (MHz) or gigahertz (GHz). For example, a 4GHz processor performs 4,000,000,000 clock cycles per second.
Computer processors can execute one or more instructions per clock cycle, depending on the type of processor. Early computer processors and slower processors can only execute one instruction per clock cycle, but faster, more advanced processors can execute multiple instructions per clock cycle, processing data more efficiently.

And on other hand, Clock Speed (Sometimes called as Clock Rate) refers to the number of pulses per second generated by an oscillator that sets the tempo for the processor. Clock speed is usually measured in MHz (megahertz, or millions of pulses per second) or GHz (gigahertz, or billions of pulses per second). Today's personal computers run at a clock speed in the hundreds of megahertz and some exceed one gigahertz. The clock speed is determined by a quartz-crystal circuit, similar to those used in radio communications equipment.

CPU-usage or CPU-cycles of a process/pid with respect to the maximum CPU frequency

Thanks to BlackJack's comment

What about implementing it in C as shared library and using it via ctypes in Python?

A library call introduces less overhead. A subprocess call starts a whole external process and communicates the result as string over pipes every time you need the value. The shared library is loaded once into the current process and the result is passed in memory.

I implemented it as a shared library. The source code of the library cpucycles.c is (heavily based on the example of perf_event_open's man page):

#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

long long
cpu_cycles(unsigned int microseconds,
            pid_t pid,
            int cpu,
            int exclude_user,
            int exclude_kernel,
            int exclude_hv,
            int exclude_idle)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_user = exclude_user;
    pe.exclude_kernel = exclude_kernel;
    pe.exclude_hv = exclude_hv;
    pe.exclude_idle = exclude_idle;

    fd = perf_event_open(&pe, pid, cpu, -1, 0);
    if (fd == -1) {
        return -1;
    }
    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
    usleep(microseconds);
    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    close(fd);
    return count;
}

This code is compiled into a shared library via the following two commands:

$ gcc -c -fPIC cpucycles.c -o cpucycles.o
$ gcc -shared -Wl,-soname,libcpucycles.so.1 -o libcpucycles.so.1.0.1 cpucycles.o

Finally, the library can be used by Python in cpucycles.py:

import ctypes
import os

cdll = ctypes.cdll.LoadLibrary(os.path.join(os.path.dirname(__file__), "libcpucycles.so.1.0.1"))
cdll.cpu_cycles.argtypes = (ctypes.c_uint, ctypes.c_int, ctypes.c_int,
                            ctypes.c_int, ctypes.c_int, ctypes.c_int,
                            ctypes.c_int)
cdll.cpu_cycles.restype = ctypes.c_longlong

def cpu_cycles(duration=1.0, pid=0, cpu=-1,
                exclude_user=False, exclude_kernel=False,
                exclude_hv=True, exclude_idle=True):
    """
    See man page of perf_event_open for all the parameters.

    :param duration: duration of counting cpu_cycles [seconds]
    :type duration: float
    :returns: cpu-cycle count of pid
    :rtype: int
    """
    count = cdll.cpu_cycles(int(duration*1000000), pid, cpu,
                            exclude_user, exclude_kernel,
                            exclude_hv, exclude_idle)
    if count < 0:
                raise OSError("cpu_cycles(pid={}, duration={}) from {} exited with code {}.".format(
                    pid, duration, cdll._name, count))

    return count

Measuing CPU clock speed

This doesn't measure clock speed at all, it measures how many loop iterations can be done per second. There's no rule that says one iteration will run per clock cycle. It may be the case, and you may have actually found it to be the case - certainly with optimized code and a reasonable CPU, a useless loop shouldn't run much slower than that. It could run at half speed though, some processors are not able to retire more than 1 taken branch every 2 cycles. And on esoteric targets, all bets are off.

So no, this doesn't measure clock cycles, except accidentally. In general it's extremely hard to get an empirical clock speed (you can ask your OS what it thinks the maximum clock speed and current clock speed are, see below), because

If you measure how much wall clock time a loop takes, you must know (at least approximately) the number of cycles per iteration. That's a bad enough problem in assembly, requiring fairly detailed knowledge of the expected microarchitectures (maybe a long chain of dependent instructions that each could only reasonably take 1 cycle, like add eax, 1? a long enough chain that differences in the test/branch throughput become small enough to ignore), so obviously anything you do there is not portable and will have assumptions built into it may become false (actually there is an other answer on SO that does this and assumes that addps has a latency of 3, which it doesn't anymore on Skylake, and didn't have on old AMDs). In C? Give up now. The compiler might be rolling some random code generator, and relying on it to be reasonable is like doing the same with a bear. Guessing the number of cycles per iteration of code you neither control nor even know is just folly. If it's just on your own machine you can check the code, but then you could just check the clock speed manually too so..
If you measure the number of clock cycles elapsed in a given amount of wall clock time.. but this is tricky. Because rdtsc doesn't measure clock cycles (not anymore), and nothing else gets any closer. You can measure something, but with frequency scaling and turbo, it generally won't be actual clock cycles. You can get actual clock cycles from a performance counter, but you can't do that from user mode. Obviously any way you try to do this is not portable, because you can't portably ask for the number of elapsed clock cycles.

So if you're doing this for actual information and not just to mess around, you should probably just ask the OS. For Windows, query WMI for CurrentClockSpeed or MaxClockSpeed, whichever one you want. On Linux there's stuff in /proc/cpuinfo. Still not portable, but then, no solution is.

As for

how do I know how many core's are being involved in the process?

1. Of course your thread may migrate between cores, but since you only have one thread, it's on only one core at any time.