Reliability of Linux Kernel Add_Timer at Resolution of One Jiffy

Reliability of Linux kernel add_timer at resolution of one jiffy?

Many thanks for all the comments and answers; they all pointed to things that must be taken into account - but given I'm somewhat of a forever noob, I still needed to do some more reading, before gaining some understanding (I hope a correct one). Also, I couldn't really find anything specific for periodically "ticking" functions - so I'll post a more verbose answer here.

In brief - for a reliable periodic Linux kernel function at a resolution of a jiffy, do not use add_timer (<linux/time.h>), as it may "drop" an entire period; use high-resolution timers (<linux/hrtimer.h>) instead. In more detail:

Is it possible that I get a "wrong" timestamp - ...?

@CL.: The timestamp in the log is the time when that string was printed to the log.

So, maybe it's possible - but it turns out, that's not the problem here:

Is this expected behavior from add_timer at this resolution (that a period can occasionally be missed)?

I guess, it turns out - yes:

If so, is there a way to "force" add_timer to fire the function at each 4ms slot, as specified by a jiffy on this platform?

... and (I guess again), it turns out - no.

Now, the reasons for this are somewhat subtle - and I hope if I didn't get them right, someone will correct me. First of all, the first misconception that I had, was that "a clock is just a clock" (in the sense of: even if it is implemented as computer code) - but that is not quite correct. The kernel basically has to "queue" an "event" somewhere, each time something like add_timer is used; and this request may come from anything really: from any (and all) sort(s) of driver(s), or even possibly userspace.

The problem is that this "queuing" costs - since in addition to the kernel having to handle (the equivalent of) traversing and inserting (and removing) items in an array, it also has to handle timer delays spanning several orders of magnitude (from say milliseconds to maybe 10s of seconds); and the fact that some drivers (like, apparently, those for network protocols) apparently queue a lot of timer events, which are usually cancelled before running - while other types may require a completely different behavior ^{(like in my case - in a periodic function, you expect that most of the time, the event will usually not be cancelled; and you also queue the events one by one)}. On top of that, the kernel needs to handle this for uniprocessor vs. SMP vs. multiprocessor platforms. Thus, there is a cost-benefit tradeoff involved in implementing timer handling in the kernel.

It turns out, the architecture around jiffies/add_timer is designed to handle the most common devices - and for them, precision at a resolution of a jiffy is not an issue; but this also means that one cannot expect a reliable timer at resolution of a single jiffy with this method. This is also compounded by the fact that the kernel handles these "event queues" by treating them (somewhat) like interrupt service requests (IRQ); and that there are several levels of priority in IRQ handling in the kernel, where higher priority routine can pre-empt a lower priority one (that is: interrupt and suspend a lower priority routine, even if it is being executed at the time - and allow the higher priority routine to go about its business). Or, as previously noted:

@granquet: timers run in soft irq context, which means they have the highest priority and they preempt everything running/runnable on the CPU ... but hardware interrupts which are not disabled when servicing a soft irq. So you might (most probable explanation) get an Hardware interrupt here and there that preempts your timer ... and thus you get an interrupt that is not serviced at the right time.

@CL.: It is indeed possible that your timer function gets called at a later jiffy than what expires what set to. Possible reasons are scheduling delays, other drivers that disable interrupts for too long (graphics and WLAN drivers are usual culprits), or some crappy BIOS executing SMI code.

I now think so, too - I think this could be an illustration of what happens:

jiffies changes to, say, 10000 (== 40000 ms @250 Hz)
Let's say the timer function, (queued by add_timer) is about to start running - but hasn't started running yet
Let's say here, the network card generates (for whatever reason) a hardware interrupt
The hardware interrupt, having a higher priority, triggers the kernel to pre-empt (stop and suspend) the timer function (possibly started by now, and just few instructions in);
That means the kernel now has to reschedule the timer function, to run at a later point - and since one only works with integer operations in the kernel, and time resolution for this kind of event is in jiffies - the best it can do is reschedule it for jiffies+1 (10001 == 40004 ms @250 Hz)
Now the kernel switches the context to the IRQ service routine of the network card driver, and it goes about its business
Let's say the IRQ service routine completes in 200 μs - that means now we're (in "absolute" terms) at 40000.2 ms - however, we are also still at 10000 jiffies
If the kernel now switched the context back to the timer function, it would have completed - without me necessarily noticing the delay;
... however, that will not happen, because the timer function is scheduled for the next jiffy!
So kernel goes about its business (possibly sleeping) for the next approx 3.8 ms
jiffies changes to 10001 (== 40004 ms @250 Hz)
(the previously rescheduled) timer function runs - and this time completes without interruption

I haven't really done a detailed analysis to see if the sequence of events is exactly as described above; but I'm quite persuaded that it is something close - in other words, a resolution problem - especially since the high-resolution timer approach seems to not show this behavior. It would be great indeed, to obtain a scheduler log, and know exactly what happened to cause a pre-empt - but I doubt the roundtrip to userspace, which I attempted in OP edit, in response to @granquet's comment, is the right thing to do.

In any case, going back to this:

Note that I'm not looking for a period resolution below what corresponds to a jiffy (in this case, 4ms); nor am I looking to decrease the delta variance when the code works properly. So as I see it, I don't have "high resolution timer" demands, nor "hard real-time" demands ...

... here was a bad mistake I made - as the analysis above shows, I did have "high resolution" demands! And had I realized that earlier, I may have found relevant reading sooner. Anyways, some relevant docs - even if they don't discuss specifically periodic functions - for me, were:

LDD3: 5.3. Semaphores and Mutexes - (in describing a driver with different demands from here): ^{"no accesses will be made from interrupt handlers or other asynchronous contexts. There are no particular latency (response time) requirements; application programmers understand that I/O requests are not usually satisfied immediately"}
Documentation/timers/hrtimers.txt - ^{"The timers.c code is very "tightly coded" around jiffies and 32-bitness assumptions, and has been honed and micro-optimized for a relatively narrow use case (jiffies in a relatively narrow HZ range) for many years - and thus even small extensions to it easily break the wheel concept"}
T. Gleixner, D. Niehaus Hrtimers and Beyond: Transforming the Linux Time Subsystems (pdf) - (most detailed, see also diagrams inside) ^{"The Cascading Timer Wheel (CTW), which was implemented in 1997, replaced the original time ordered double linked list to resolve the scalability problem of the linked list's O(N) insertion time... The current approach to timer management in Linux does a good job of satisfying an extremely wide range of requirements, but it cannot provide the quality of service required in some cases precisely because it must satisfy such a wide range of requirements... The timeout related timers are kept in the existing timer wheel and a new subsystem optimized for (high resolution) timer requirements hrtimers was implemented. hrtimers are entirely based on human time (units: nanoseconds)... They are kept in a time sorted, per-CPU list, implemented as a red-black tree."}
The high-resolution timer API [LWN.net] - ^{"At its core, the hrtimer mechanism remains the same. Rather than using the "timer wheel" data structure, hrtimers live on a time-sorted linked list, with the next timer to expire being at the head of the list. A separate red/black tree is also used to enable the insertion and removal of timer events without scanning through the list. But while the core remains the same, just about everything else has changed, at least superficially."}
Software interrupts and realtime [LWN.net] - ^{"The softirq mechanism is meant to handle processing that is almost — but not quite — as important as the handling of hardware interrupts. Softirqs run at a high priority (though with an interesting exception, described below), but with hardware interrupts enabled. They thus will normally preempt any work except the response to a "real" hardware interrupt... Starting with the 3.0 realtime patch set, though, that capability went away... In response, in 3.6.1-rt1, the handling of softirqs has changed again."}
High- (but not too high-) resolution timeouts [LWN.net] - ^{"_poll() and epoll_wait() take an integer number of milliseconds; select() takes a struct timeval with microsecond resolution, and ppoll() and pselect() take a struct timespec with nanosecond resolution. They are all the same, though, in that they convert this timeout value to jiffies, with a maximum resolution between one and ten milliseconds. A programmer might program a pselect() call with a 10 nanosecond timeout, but the call may not return until 10 milliseconds later, even in the absence of contention for the CPU. ... It's a useful feature, but it comes at the cost of some significant API changes._"}

One thing clear from the quotes, is that high-resolution timing facilities are still under active development (with API changes) in the kernel - and I was afraid, that maybe I'd have to install a special "real-time patch" kernel. Thankfully, high-resolution timers are seemingly available (and working) in my 2.6.38-16 SMP kernel without any special changes. Below is the listing of the modified testjiffies.c kernel module, which now uses high-resolution timers, but otherwise keeps the same period as determined by jiffies. For testing, I made it loop for 200 times (instead of 10 in the OP); and running the rerun.sh script for some 20-30 times, this is the worst result I got:

The time sequence is now obviously unreadable, but the histogram can still tell us this: taking 0.00435-0.004 (= 0.004-0.00365) = 350 μs for the max deviation, it represents only 100*(350/4000) = 8.75% of the expected period; which I certainly don't have a problem with. Additionally, I never got a drop (or correspondingly, an entire 2*period = 8 ms delay), or a 0 ms delay - the captures I got, are otherwise of the quality shown on the first image in OP. Now, of course I could run a longer test and see more precisely how reliable it is - but this is all the reliability I'd expect/need to see for this simple case; contrast that to the OP, where I'd get a drop in just 10 loops, with the probability of tossing a coin - every second or third run of the rerun.sh script, I'd get a drop - even in context of low OS resource usage!

Finally, note that the source below should have the problem, spotted by @CL.: "Your module is buggy: you must ensure that the timer is not pending before the module is unloaded", fixed (in the context of hrtimer). This seemingly answers my bonus question, as it obviates the need for either of the "MUSTHAVE" sleeps in the rerun.sh script. However, note that as 200 loops @ 4 ms take 0.8 s - the sleep between insmod and rmmod is needed if we want a full 200 tick capture (otherwise, on my machine, I get only some 7 ticks captured).

Well, hope I got this right now (at least most if it) - if not, corrections are welcome :)

testjiffy(-hr).c

#include <linux/module.h>   /* Needed by all modules */
#include <linux/kernel.h>   /* Needed for KERN_INFO */
#include <linux/init.h>     /* Needed for the macros */
#include <linux/jiffies.h>
#include <linux/time.h>
#define MAXRUNS 200

#include <linux/hrtimer.h>

static volatile int runcount = 0;

//~ static struct timer_list my_timer;
static unsigned long period_ms;
static unsigned long period_ns;
static ktime_t ktime_period_ns;
static struct hrtimer my_hrtimer;

//~ static void testjiffy_timer_function(unsigned long data)
static enum hrtimer_restart testjiffy_timer_function(struct hrtimer *timer)
{
  int tdelay = 100;
  unsigned long tjnow;
  ktime_t kt_now;
  int ret_overrun;

  runcount++;
  if (runcount == 5) {
    while (tdelay > 0) { tdelay--; } // small delay
  }

  printk(KERN_INFO
    " %s: runcount %d \n",
    __func__, runcount);

  if (runcount < MAXRUNS) {
    tjnow = jiffies;
    kt_now = hrtimer_cb_get_time(&my_hrtimer);
    ret_overrun = hrtimer_forward(&my_hrtimer, kt_now, ktime_period_ns);
    printk(KERN_INFO
      " testjiffy jiffies %lu ; ret: %d ; ktnsec: %lld \n",
      tjnow, ret_overrun, ktime_to_ns(kt_now));
    return HRTIMER_RESTART;
  }
  else return HRTIMER_NORESTART;
}

static int __init testjiffy_init(void)
{
  struct timespec tp_hr_res;
  period_ms = 1000/HZ;
  hrtimer_get_res(CLOCK_MONOTONIC, &tp_hr_res);
  printk(KERN_INFO
    "Init testjiffy: %d ; HZ: %d ; 1/HZ (ms): %ld ; hrres: %lld.%.9ld\n",
               runcount,      HZ,        period_ms, (long long)tp_hr_res.tv_sec, tp_hr_res.tv_nsec );

  hrtimer_init(&my_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
  my_hrtimer.function = &testjiffy_timer_function;
  period_ns = period_ms*( (unsigned long)1E6L );
  ktime_period_ns = ktime_set(0,period_ns);
  hrtimer_start(&my_hrtimer, ktime_period_ns, HRTIMER_MODE_REL);

  return 0;
}

static void __exit testjiffy_exit(void)
{
  int ret_cancel = 0;
  while( hrtimer_callback_running(&my_hrtimer) ) {
    ret_cancel++;
  }
  if (ret_cancel != 0) {
    printk(KERN_INFO " testjiffy Waited for hrtimer callback to finish (%d)\n", ret_cancel);
  }
  if (hrtimer_active(&my_hrtimer) != 0) {
    ret_cancel = hrtimer_cancel(&my_hrtimer);
    printk(KERN_INFO " testjiffy active hrtimer cancelled: %d (%d)\n", ret_cancel, runcount);
  }
  if (hrtimer_is_queued(&my_hrtimer) != 0) {
    ret_cancel = hrtimer_cancel(&my_hrtimer);
    printk(KERN_INFO " testjiffy queued hrtimer cancelled: %d (%d)\n", ret_cancel, runcount);
  }
  printk(KERN_INFO "Exit testjiffy\n");
}

module_init(testjiffy_init);
module_exit(testjiffy_exit);

MODULE_LICENSE("GPL");

high resolution timing in kernel?

Assuming the kernel you are running has the Hi-Res timer support turned on (it is a build time config option) and that you have a proper timer hardware which can provide the needed support to raise an interrupt in such granularity, you can use the in kernel hrtimer API to register a timer with your requirement.

Here is the hrtimer documentation: http://www.mjmwired.net/kernel/Documentation/timers/hrtimers.txt

Bare in mind though, that for truly getting uninterrupted responses on such a scale you most probably also need to apply and configure the Linux RT (aka PREEMPT_RT) patches.

You can read more here: http://elinux.org/Real_Time

Where is jiffies computed in the Linux kernel?

Look at do_timer(). It was moved to kernel/time/timekeeping.c at some point in the past few years.

jiffies does not directly get incremented, it gets assigned the low order 32-bit of jiffies_64

/* 
 * The 64-bit jiffies value is not atomic - you MUST NOT read it 
 * without sampling the sequence number in xtime_lock. 
 * jiffies is defined in the linker script... 
 */ 
void do_timer(unsigned long ticks) 
{ 
 jiffies_64 += ticks; 
 update_wall_time(); 
 calc_global_load(ticks); 
}

In 3.2 it is http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.2#L1192

jiffies gets the value from jiffies_64 here in the machine specific file:

http://lxr.free-electrons.com/source/arch/arm/kernel/vmlinux.lds.S?v=3.2

 36 #ifndef __ARMEB__
 37 jiffies = jiffies_64;
 38 #else
 39 jiffies = jiffies_64 + 4;
 40 #endif

Subtraction of two unsigned long variables result in zero, even though each of their values are distinct

As noted in the comments, j_1 == j_0. This is understandable as jiffies is incremented every timer interrupt. The frequency with which this happens can be defined by CONFIG_HZ, e.g. on my VM:

grep 'CONFIG_HZ=' /boot/config-$(uname -r)
CONFIG_HZ=250

250Hz = one timer interrupt every 4 ms. This granularity is way too coarse to measure the impact of a printk (and a single addition).

For sub-jiffy time measurements, you can use ftrace, do_gettimeofday or perf. This has been asked before. See e.g. this question's answers.

hrtimer repeating task in the Linux kernel

If you look in kernel/sched.c around line 170 in the function sched_rt_period_timer, you will see an example usage. The essential lines are

now = hrtimer_cb_get_time(timer);                               
overrun = hrtimer_forward(timer, now, rt_b->rt_period);

Now get's the timer's current time as a ktime_t and rt_b->rt_period is another ktime_t specifying the period at which to advance timer. The expiration time of the hrtimer will be continuously incremented by the period until it is greater than the current time. If it took more than one addition of the period to get the expiration time greater than the current time, the return value will greater than 1 (indicating more overrruns). It can be zero, if the timer expire didn't get advanced at all.

Reference: http://lwn.net/Articles/167897/

The API it uses is from a different version of the kernel so some of the arguments have changed. The basic idea is still the same.

Bypass softlockup_threshold

You could try to yield the CPU for a while, If that's acceptable, then you should look into schedule() There's a great article here about Sleeping in the Kernel

Reliability of Linux Kernel Add_Timer at Resolution of One Jiffy