When to Use Linux Kernel Add_Timer Vs Queue_Delayed_Work

When to use linux kernel add_timer vs queue_delayed_work

As I stated in my question, queue_delayed_work just uses add_timer internally. So the use is equally.

When to use kernel threads vs workqueues in the linux kernel

As you said, it depends on the task at hand:

Work queues defer work into a kernel thread - your work will always run in process
context. They are schedulable and can therefore sleep.

Normally, there is no debate between work queues or sotftirqs/tasklets; if the deferred work needs to sleep, work queues are used, otherwise softirqs or tasklets are used. Tasklets are also more suitable for interrupt handling (they are given certain assurances such as: a tasklet is never ran later than on the next tick, it's always serialized with regard to itself, etc.).

Kernel timers are good when you know exactly when you want something to happen, and do not want to interrupt/block a process in the meantime. They run outside process context, and they are also asynchronous with regard to other code, so they're the source of race conditions if you're not careful.

Hope this helps.

CPU Handling with Delayed Work

As you can see, queue_delayed_work will set cpu argument to WORK_CPU_UNBOUND. This value is defined to be bigger than the actual number of CPUs supported by the kernel. This value is passed to __queue_delayed_work that, if delay is non zero, will use timers (using add_timer function to fire a callback function delayed_work_timer_fn after specified time (this callback function is defined at work queue initialization). All this callback function does is to call __queue_work, still passing WORK_CPU_UNBOUND as cpu argument. So the whole "magic" is happening there.

This function will check if the cpu argument is set to WORK_CPU_UNBOUND and choose cpu to be the current processor:

if (req_cpu == WORK_CPU_UNBOUND)
    cpu = raw_smp_processor_id()

So the work will be executed on the processor which handles the timer interrupt set before. Now I didn't study the timer code but IIRC from LDD3 book, timer interrupts will be handled by the CPU they were registered on (unless this CPU will be disabled in the meantime, of course, in which case the timer IRQ will be moved to other CPU) but that book is old some this may not be true any more.

There is another hint in the kernel code that should prove what I wrote - see the comments of queue_work function that says: "We queue the work to the CPU on which it was submitted, but if the CPU dies it can be processed by another CPU". This function also uses WORK_CPU_UNBOUND as a cpu argument.

Timer migration details

As stated before, if some processor goes down, it can no longer handle IRQs, thus it wont be able to handle timers that it has registered. Because of that, kernel will migrate all pending timers to other CPUs when CPU is going offline. This task is done by migrate_timers() function which is run by timer_cpu_notify that in turn is a callback registered as cpu_notifier.

migrate_timers is run when cpu state is changed to CPU_DEAD or CPU_DEAD_FROZEN. This state is set inside of _cpu_down function by calling:

cpu_notify_nofail(CPU_DEAD | mod, hcpu);

It is called after __cpu_die(cpu) which ensures the CPU we were disabling is no longer working so we can be sure this code runs on some other CPU. migrate_timers will reassign all timers to the CPU its running on.

So where is the decision on which CPU should takeover timers done? One could say that it's done by scheduler:

If you call cpu_down on different CPU than the one you want to disable, then this is the CPU that will takeover.
If you call cpu_down on the CPU that is going to be disabled, it will schedule itself out in __cpu_die and the rest of the code will then be rescheduled on some other CPU.

Is do_timer() supposed to be called on only one core in SMP systems?

Let me answer my own question after googling and reading code.

do_timer() is supposed to be called on cpu with ID kept in tick_do_timer_cpu variable.

kernel/time/tick-common.c

/*

* tick_do_timer_cpu is a timer core internal variable which holds the CPU NR

* which is responsible for calling do_timer(), i.e. the timekeeping stuff.This

* variable has two functions:

*

* 1) Prevent a thundering herd issue of a gazillion of CPUs trying to grab the

* timekeeping lock all at once. Only the CPU which is assigned to do the

* update is handling it.

*

* 2) Hand off the duty in the NOHZ idle case by setting the value to

* TICK_DO_TIMER_NONE, i.e. a non existing CPU. So the next cpu which looks

* at it will take over and keep the time keeping alive. The handover

* procedure also covers cpu hotplug.

*/

tick_do_timer_cpu is checked against current CPU ID in tick_periodic() or in tick_sched_do_timer(). If current CPU is the same do_timer() is called otherwise not.

static void tick_periodic(int cpu)
 {
      if (tick_do_timer_cpu == cpu) {
              write_seqlock(&jiffies_lock);

              /* Keep track of the next tick event */
              tick_next_period = ktime_add(tick_next_period, tick_period);

              do_timer(1);
              write_sequnlock(&jiffies_lock);
              update_wall_time();
      }

      update_process_times(user_mode(get_irq_regs()));
      profile_tick(CPU_PROFILING);
  }`

This way jiffies management is done on one core in SMP systems.

How to modify kernel timer_list timeout

Kernel has no mechanism for detect changes in variables. Instead, you should perform corresponded actions before/after your code changes your variable.

When you add sysctl entry, you also set handler for it(ctl_table->proc_handler). This handler defines actions, which are executed when read/write method for entry is called. Standard proc_do* functions only set/get value of variable, so you should define your handler. Something like this:

int my_handler(struct ctl_table *table, int write,
     void __user *buffer, size_t *lenp, loff_t *ppos)
{
    // Call standard helper..
    int res = proc_dointvec(table, write, buffer, lenp, ppos);
    if(write && !res) { 
       // Additional actions on successfull write.
    }
    return res;
}

Modification of the timer's timeout can be performed using mod_timer function.

When to Use Linux Kernel Add_Timer Vs Queue_Delayed_Work