What Is the Concept of Vruntime in Cfs

What is the concept of vruntime in CFS

vruntime is per-thread; it is a member nested within the task_struct.

Essentially, vruntime is a measure of the "runtime" of the thread - the amount of time it has spent on the processor. The whole point of CFS is to be fair to all; hence, the algo kind of boils down to a simple thing: (among the tasks on a given runqueue) the task with the lowest vruntime is the task that most deserves to run, hence select it as 'next'. (The actual implementation is done using an rbtree for efficiency).

Taking into account various factors - like priority, nice value, cgroups, etc - the calculation of vruntime is not as straight-forward as a simple increment. I'd suggest reading the relevant section in "Professional Linux Kernel Architecture", Mauerer, Wrox Press - it's explained in great detail.

Pl see below a quick attempt at summarizing some of this.

Other resource:
Documentation/scheduler/sched-design-CFS.txt

Quick summary - vruntime calculation:
(based on the book)

Most of the work is done in kernel/sched_fair.c:__update_curr()
Called on timer tick
Updates the physical and virtual time 'current' has just spent on the processor
For tasks that run at default priority, i.e., nice value 0, the physical and virtual time spent is identical
Not so for tasks at other priority (nice) levels; thus the calculation of vruntime is affected by the priority of current using a load weight factor
delta_exec = (unsigned long)(now – curr->exec_start);
// ...
delta_exec_weighted = calc_delta_fair(delta_exec, curr);
curr->vruntime += delta_exec_weighted;

Neglecting some rounding and overflow checking, what calc_delta_fair does is to
compute the value given by the following formula:

delta_exec_weighed = delta_exec * (NICE_0_LOAD / curr->load.weight)

The thing is, more important tasks (those with a lower nice value) will have larger
weights; thus, by the above equations, the vruntime accounted to them will be smaller
(thus having them enqueued more to the left on the rbtree!).

Completely Fair Scheduler (CFS): vruntime of long running processes

The kernel documentation for CFS kind of glosses over what would be the answer to your question, but mentions it briefly:

In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks.

So, vruntime is actually normalized. But the documentation does not go into detail.

How is it actually done?

Normalization happens by means of a min_vruntime value. This min_vruntime value is recorded in the CFS runqueue (struct cfs_rq). The min_vruntime value is the smallest vruntime of all tasks in the rbtree. The value is also used to track all the work done by the cfs_rq.

You can observe an example of normalization being performed in CFS' enqueue_entity() code:

2998 static void
2999 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
3000 {
3001         /*
3002          * Update the normalized vruntime before updating min_vruntime
3003          * through calling update_curr().
3004          */
3005         if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
3006                 se->vruntime += cfs_rq->min_vruntime;
3007 
3008         /*
3009          * Update run-time statistics of the 'current'.
3010          */
3011         update_curr(cfs_rq);
...
3031 }

You can also observe in update_curr() how vruntime and min_vruntime are kept updated:

701 static void update_curr(struct cfs_rq *cfs_rq)
702 {
703         struct sched_entity *curr = cfs_rq->curr;
...
713 
714         curr->exec_start = now;
... 
719         curr->sum_exec_runtime += delta_exec;
... 
722         curr->vruntime += calc_delta_fair(delta_exec, curr);
723         update_min_vruntime(cfs_rq);
... 
733         account_cfs_rq_runtime(cfs_rq, delta_exec);
734 }

The actual update to min_vruntime happens in the aptly named update_min_vruntime() function:

457 static void update_min_vruntime(struct cfs_rq *cfs_rq)
458 {
459         u64 vruntime = cfs_rq->min_vruntime;
460 
461         if (cfs_rq->curr)
462                 vruntime = cfs_rq->curr->vruntime;
463 
464         if (cfs_rq->rb_leftmost) {
465                 struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
466                                                    struct sched_entity,
467                                                    run_node);
468 
469                 if (!cfs_rq->curr)
470                         vruntime = se->vruntime;
471                 else
472                         vruntime = min_vruntime(vruntime, se->vruntime);
473         }
474 
475         /* ensure we never gain time by being placed backwards. */
476         cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
...
481 }

By ensuring that min_vruntime is properly updated, it follows that normalization based on min_vruntime stays consistent. (You can see more examples of where normalization based on min_vruntime occurs by grepping for "normalize" or "min_vruntime" in fair.c.)

So in simple terms, all CFS tasks' vruntime values are normalized based on the current min_vruntime, which ensures that in your example, the newer task's vruntime will rapidly approach equilibrium with the older task's vruntime. (We know this because the documentation states that min_vruntime is monotonically increasing.)

Linux CFS (Completely Fair Scheduler) latency

Firstly the virtual runtime of a task

in theory is the when the task would start its next time slice of
execution on a theoretically perfect multiple threaded CPU.
in practice is its actual runtime normalized to the total number of running tasks

1. How does the scheduler schedule all of the tasks within the
sysctl_sched_latency time?

It maintains a time ordered red and black tree, where all the runnable tasks are
sorted by their virtual runtime. Nodes on the left have run for the shortest amount of time.
CFS picks the left most task and runs it, until the task schedules or the scheduler ticks
then the CPU time it spent running is added to its virtual runtime.
When it is no longer the left most node, then new task with the shortest virtual is run and
the old task prempted.

2. When a process wakes up what actually is done in the place_entity function?

Short version:

When a process wakes up the place_entity function either leaves the
task's virtual runtime as it was or increases it.

Long version:

When a process wakes up the place_entity function does the following things

Initialise the temporary virtual runtime to the CFS run queue's virtual runtime of the smallest task.
As sleeps less than a single latency don't count,
initializses a threshold variable to sysctl_sched_latency.
If the GENTLE_FAIR_SLEEPERS feature is enabled,
then half the value of the this variable.
Decrement the previously initialised temporary virtual runtime by this threshold value.
Ensure that the temporary virtual runtime is at least equal to the task's virtual runtime, by setting the calculated virtual runtime to the maximum of itself and the task's virtual runtime.
Set the task's virtual runtime to the temporary runtime.

3. When a process wakes up why is the vruntime adjusted by subtracting from sched_latency?

The virtual runtime is decremented because sleeps less than a single latency don't count.
E.g the task shouldn't have its position changed in the red black tree changed if it has
only slept for a single scheduler latency.

4. Can't that lead to processes in the run queue with large differences in the vruntime value?

I believe that the logic described in Step 3 for Question 2, prevents or at least minimises that.

References

sched Linux Kernel Source

sched_fair.c Linux Kernel Source

Notes on the CFS Scheduler Design

CFS scheduler: change vruntime of task to slow it down

You Don't need to do it in the kernel.(Your final goal is not to cheat the cfs).
If I understood, you need to make a particular process running with lower CPU usages.

You can start the particular process with lower scheduling priority.
nice -n 10 your_process arguments
You can type `man nice' to get more information
If you can change the code of the particular process, you can use nice() in the code. Also you can type man 2 nice to get more information
You can use cgroup/cpu to achieve it.
mount -t cgroup -ocpu cgroup /cgroup
And then use it.

I suggest the first one.

Finally, if you stick in kernel, you can try to use __dequeue_entity() to dequeue the entity and change the vruntime and then use __enqueue_entity() to insert it back. It is just a hint, it may be wrong, no one like suck hack, it is not my answer.

Which kernel function manages balancing the RB-tree of the linux CFS?

While Ahmed is right about update_curr, I was still confused about the fact the tree could be unbalanced between the moment vruntime is updated and the eventual call to resched_curr. As a matter of fact, it turns out that the task that is currently scheduled is not part of the tree.

It is removed from the tree in set_next_entity(cfs_rq_of(se), se); which is called when pick_next_task_fair is invoked. So the tree is always balanced and the only function performing the balancing are enqueue_entity and dequeue_entity

What Is the Concept of Vruntime in Cfs