Arm Performance Counters Vs Linux Clock_Gettime

ARM performance counters vs linux clock_gettime

I found the solution. I upgraded the platform from a linux kernel 3.3.0 to 3.5 and the value is similar to that of the performance counters. Apparently the frequency of the clock counter in 3.3.0 is assumed higher that what it is ( around 400 MHz ) instead of half of the CPU frequency. Probably a porting error in the old version.

Profiling tools for Linux and performance monitoring counters for ARM

ARMv7-A profile, which is targeted for hosting a rich OS has similar performance counters like Intel . It looks like you have looked into v7-M profile which is targeted for micro-controller environments. Most of the recent ARM cores that runs Linux, like Cortex-A9 are from this profile.

Perf already supports performance counters on ARM architecture, same with oprofile.

ARM company also provides a eye-candy eclipse based environment called DS-5 Streamline as well with lots of extra features to help you analyze performance issues.

program execution time in ARM Cortex-A8 processor

a) PMU regs may be used by the perf subsystem of the Linux kernel (accessed through the perf userspace tool).

b) CCNT frequency is a Cortex-A9 CPU cycle counter, or cycles/64 if you enable the divider. So 7MHz with the divider would be an average CPU clock of around 450 MHz. This is separate from the 24 MHz system clock.

c) maybe your process got scheduled out. This is a low level cycle counter for the whole CPU, not just your process. It will keep running when in the kernel or in another process. On the other hand if your process migrates to another CPU you will then access that CPU's cycle counter (which might not even have the same divider setting). If you want a consistent count you should be pinning your process to one CPU.

d) similar answer to (c), you may be seeing the effect of process scheduling and migration.

How to measure program execution time in ARM Cortex-A8 processor?

Accessing the performance counters isn't difficult, but you have to enable them from kernel-mode. By default the counters are disabled.

In a nutshell you have to execute the following two lines inside the kernel. Either as a loadable module or just adding the two lines somewhere in the board-init will do:

  /* enable user-mode access to the performance counter*/
  asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));

  /* disable counter overflow interrupts (just in case)*/
  asm ("MCR p15, 0, %0, C9, C14, 2\n\t" :: "r"(0x8000000f));

Once you did this the cycle counter will start incrementing for each cycle. Overflows of the register will go unnoticed and don't cause any problems (except they might mess up your measurements).

Now you want to access the cycle-counter from the user-mode:

We start with a function that reads the register:

static inline unsigned int get_cyclecount (void)
{
  unsigned int value;
  // Read CCNT Register
  asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));  
  return value;
}

And you most likely want to reset and set the divider as well:

static inline void init_perfcounters (int32_t do_reset, int32_t enable_divider)
{
  // in general enable all counters (including cycle counter)
  int32_t value = 1;

  // peform reset:  
  if (do_reset)
  {
    value |= 2;     // reset all counters to zero.
    value |= 4;     // reset cycle counter to zero.
  } 

  if (enable_divider)
    value |= 8;     // enable "by 64" divider for CCNT.

  value |= 16;

  // program the performance-counter control-register:
  asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));  

  // enable all counters:  
  asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));  

  // clear overflows:
  asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}

do_reset will set the cycle-counter to zero. Easy as that.

enable_diver will enable the 1/64 cycle divider. Without this flag set you'll be measuring each cycle. With it enabled the counter gets increased for every 64 cycles. This is useful if you want to measure long times that would otherwise cause the counter to overflow.

How to use it:

  // init counters:
  init_perfcounters (1, 0); 

  // measure the counting overhead:
  unsigned int overhead = get_cyclecount();
  overhead = get_cyclecount() - overhead;    

  unsigned int t = get_cyclecount();

  // do some stuff here..
  call_my_function();

  t = get_cyclecount() - t;

  printf ("function took exactly %d cycles (including function call) ", t - overhead);

Should work on all Cortex-A8 CPUs..

Oh - and some notes:

Using these counters you'll measure the exact time between the two calls to get_cyclecount() including everything spent in other processes or in the kernel. There is no way to restrict the measurement to your process or a single thread.

Also calling get_cyclecount() isn't free. It will compile to a single asm-instruction, but moves from the co-processor will stall the entire ARM pipeline. The overhead is quite high and can skew your measurement. Fortunately the overhead is also fixed, so you can measure it and subtract it from your timings.

In my example I did that for every measurement. Don't do this in practice. An interrupt will sooner or later occur between the two calls and skew your measurements even further. I suggest that you measure the overhead a couple of times on an idle system, ignore all outsiders and use a fixed constant instead.

Arm Performance Counters Vs Linux Clock_Gettime