What Is the Concept of Vruntime in Cfs

What is the concept of vruntime in CFS

vruntime is per-thread; it is a member nested within the task_struct.

Essentially, vruntime is a measure of the "runtime" of the thread - the amount of time it has spent on the processor. The whole point of CFS is to be fair to all; hence, the algo kind of boils down to a simple thing: (among the tasks on a given runqueue) the task with the lowest vruntime is the task that most deserves to run, hence select it as 'next'. (The actual implementation is done using an rbtree for efficiency).

Taking into account various factors - like priority, nice value, cgroups, etc - the calculation of vruntime is not as straight-forward as a simple increment. I'd suggest reading the relevant section in "Professional Linux Kernel Architecture", Mauerer, Wrox Press - it's explained in great detail.

Pl see below a quick attempt at summarizing some of this.

Other resource:
Documentation/scheduler/sched-design-CFS.txt

Quick summary - vruntime calculation:
(based on the book)

  • Most of the work is done in kernel/sched_fair.c:__update_curr()

  • Called on timer tick

  • Updates the physical and virtual time 'current' has just spent on the processor

  • For tasks that run at default priority, i.e., nice value 0, the physical and virtual time spent is identical

  • Not so for tasks at other priority (nice) levels; thus the calculation of vruntime is affected by the priority of current using a load weight factor

    delta_exec = (unsigned long)(now – curr->exec_start);
    // ...
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);
    curr->vruntime += delta_exec_weighted;

Neglecting some rounding and overflow checking, what calc_delta_fair does is to
compute the value given by the following formula:

delta_exec_weighed = delta_exec * (NICE_0_LOAD / curr->load.weight)

The thing is, more important tasks (those with a lower nice value) will have larger
weights; thus, by the above equations, the vruntime accounted to them will be smaller
(thus having them enqueued more to the left on the rbtree!).

Completely Fair Scheduler (CFS): vruntime of long running processes

The kernel documentation for CFS kind of glosses over what would be the answer to your question, but mentions it briefly:

In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks.

So, vruntime is actually normalized. But the documentation does not go into detail.

How is it actually done?

Normalization happens by means of a min_vruntime value. This min_vruntime value is recorded in the CFS runqueue (struct cfs_rq). The min_vruntime value is the smallest vruntime of all tasks in the rbtree. The value is also used to track all the work done by the cfs_rq.

You can observe an example of normalization being performed in CFS' enqueue_entity() code:

2998 static void
2999 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
3000 {
3001 /*
3002 * Update the normalized vruntime before updating min_vruntime
3003 * through calling update_curr().
3004 */
3005 if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
3006 se->vruntime += cfs_rq->min_vruntime;
3007
3008 /*
3009 * Update run-time statistics of the 'current'.
3010 */
3011 update_curr(cfs_rq);
...
3031 }

You can also observe in update_curr() how vruntime and min_vruntime are kept updated:

701 static void update_curr(struct cfs_rq *cfs_rq)
702 {
703 struct sched_entity *curr = cfs_rq->curr;
...
713
714 curr->exec_start = now;
...
719 curr->sum_exec_runtime += delta_exec;
...
722 curr->vruntime += calc_delta_fair(delta_exec, curr);
723 update_min_vruntime(cfs_rq);
...
733 account_cfs_rq_runtime(cfs_rq, delta_exec);
734 }

The actual update to min_vruntime happens in the aptly named update_min_vruntime() function:

457 static void update_min_vruntime(struct cfs_rq *cfs_rq)
458 {
459 u64 vruntime = cfs_rq->min_vruntime;
460
461 if (cfs_rq->curr)
462 vruntime = cfs_rq->curr->vruntime;
463
464 if (cfs_rq->rb_leftmost) {
465 struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
466 struct sched_entity,
467 run_node);
468
469 if (!cfs_rq->curr)
470 vruntime = se->vruntime;
471 else
472 vruntime = min_vruntime(vruntime, se->vruntime);
473 }
474
475 /* ensure we never gain time by being placed backwards. */
476 cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
...
481 }

By ensuring that min_vruntime is properly updated, it follows that normalization based on min_vruntime stays consistent. (You can see more examples of where normalization based on min_vruntime occurs by grepping for "normalize" or "min_vruntime" in fair.c.)

So in simple terms, all CFS tasks' vruntime values are normalized based on the current min_vruntime, which ensures that in your example, the newer task's vruntime will rapidly approach equilibrium with the older task's vruntime. (We know this because the documentation states that min_vruntime is monotonically increasing.)

Linux CFS (Completely Fair Scheduler) latency

Firstly the virtual runtime of a task

  • in theory is the when the task would start its next time slice of
    execution on a theoretically perfect multiple threaded CPU.
  • in practice is its actual runtime normalized to the total number of running tasks

1. How does the scheduler schedule all of the tasks within the
sysctl_sched_latency time?

It maintains a time ordered red and black tree, where all the runnable tasks are
sorted by their virtual runtime. Nodes on the left have run for the shortest amount of time.
CFS picks the left most task and runs it, until the task schedules or the scheduler ticks
then the CPU time it spent running is added to its virtual runtime.
When it is no longer the left most node, then new task with the shortest virtual is run and
the old task prempted.

2. When a process wakes up what actually is done in the place_entity function?

Short version:

When a process wakes up the place_entity function either leaves the
task's virtual runtime as it was or increases it.

Long version:

When a process wakes up the place_entity function does the following things

  1. Initialise the temporary virtual runtime to the CFS run queue's virtual runtime of the smallest task.

  2. As sleeps less than a single latency don't count,
    initializses a threshold variable to sysctl_sched_latency.
    If the GENTLE_FAIR_SLEEPERS feature is enabled,
    then half the value of the this variable.
    Decrement the previously initialised temporary virtual runtime by this threshold value.

  3. Ensure that the temporary virtual runtime is at least equal to the task's virtual runtime, by setting the calculated virtual runtime to the maximum of itself and the task's virtual runtime.

  4. Set the task's virtual runtime to the temporary runtime.

3. When a process wakes up why is the vruntime adjusted by subtracting from sched_latency?

The virtual runtime is decremented because sleeps less than a single latency don't count.
E.g the task shouldn't have its position changed in the red black tree changed if it has
only slept for a single scheduler latency.

4. Can't that lead to processes in the run queue with large differences in the vruntime value?

I believe that the logic described in Step 3 for Question 2, prevents or at least minimises that.

References

sched Linux Kernel Source

sched_fair.c Linux Kernel Source

Notes on the CFS Scheduler Design

CFS scheduler: change vruntime of task to slow it down

You Don't need to do it in the kernel.(Your final goal is not to cheat the cfs).
If I understood, you need to make a particular process running with lower CPU usages.

  1. You can start the particular process with lower scheduling priority.

    nice -n 10 your_process arguments

    You can type `man nice' to get more information

  2. If you can change the code of the particular process, you can use nice() in the code. Also you can type man 2 nice to get more information

  3. You can use cgroup/cpu to achieve it.

    mount -t cgroup -ocpu cgroup /cgroup

    And then use it.

I suggest the first one.

Finally, if you stick in kernel, you can try to use __dequeue_entity() to dequeue the entity and change the vruntime and then use __enqueue_entity() to insert it back. It is just a hint, it may be wrong, no one like suck hack, it is not my answer.

Which kernel function manages balancing the RB-tree of the linux CFS?

While Ahmed is right about update_curr, I was still confused about the fact the tree could be unbalanced between the moment vruntime is updated and the eventual call to resched_curr. As a matter of fact, it turns out that the task that is currently scheduled is not part of the tree.

It is removed from the tree in set_next_entity(cfs_rq_of(se), se); which is called when pick_next_task_fair is invoked. So the tree is always balanced and the only function performing the balancing are enqueue_entity and dequeue_entity



Related Topics



Leave a reply



Submit