Writing a Syscall to Count Context Switches of a Process

Writing a syscall to count context switches of a process

If your syscall should only report statistics, you can use context switch counting code that is already in the kernel.

wait3 syscall or getrusage syscall already reports context switch count in struct rusage fields:

struct rusage {
 ...
    long   ru_nvcsw;         /* voluntary context switches */
    long   ru_nivcsw;        /* involuntary context switches */
};

You can try it by running:

$ /usr/bin/time -v /bin/ls -R
....
    Voluntary context switches: 1669
    Involuntary context switches: 207

where "/bin/ls -R" is any program.

By searching an "struct rusage" in kernel sources, you can find this accumulate_thread_rusage in kernel/sys.c, which updates rusage struct. It reads from struct task_struct *t; the fields t->nvcsw; and t->nivcsw;:

1477  static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r)
1478  {
1479        r->ru_nvcsw += t->nvcsw;    // <<=== here
1480        r->ru_nivcsw += t->nivcsw;
1481        r->ru_minflt += t->min_flt;
1482        r->ru_majflt += t->maj_flt;

Then you should search nvcsw and nivcsw in kernel folder to find how they are updated by kernel.

asmlinkage void __sched schedule(void):

4124     if (likely(prev != next)) {         // <= if we are switching between different tasks
4125            sched_info_switch(prev, next);
4126            perf_event_task_sched_out(prev, next);
4127
4128            rq->nr_switches++;          
4129            rq->curr = next;
4130            ++*switch_count;     // <= increment nvcsw or nivcsw via pointer
4131
4132            context_switch(rq, prev, next); /* unlocks the rq */

Pointer switch_count is from line 4091 or line 4111 of the same file.

PS: Link from perreal is great: http://oreilly.com/catalog/linuxkernel/chapter/ch10.html (search context_swtch)

Writing a syscall to count context switches of a process

If your syscall should only report statistics, you can use context switch counting code that is already in the kernel.

wait3 syscall or getrusage syscall already reports context switch count in struct rusage fields:

struct rusage {
 ...
    long   ru_nvcsw;         /* voluntary context switches */
    long   ru_nivcsw;        /* involuntary context switches */
};

You can try it by running:

$ /usr/bin/time -v /bin/ls -R
....
    Voluntary context switches: 1669
    Involuntary context switches: 207

where "/bin/ls -R" is any program.

1477  static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r)
1478  {
1479        r->ru_nvcsw += t->nvcsw;    // <<=== here
1480        r->ru_nivcsw += t->nivcsw;
1481        r->ru_minflt += t->min_flt;
1482        r->ru_majflt += t->maj_flt;

Then you should search nvcsw and nivcsw in kernel folder to find how they are updated by kernel.

asmlinkage void __sched schedule(void):

4124     if (likely(prev != next)) {         // <= if we are switching between different tasks
4125            sched_info_switch(prev, next);
4126            perf_event_task_sched_out(prev, next);
4127
4128            rq->nr_switches++;          
4129            rq->curr = next;
4130            ++*switch_count;     // <= increment nvcsw or nivcsw via pointer
4131
4132            context_switch(rq, prev, next); /* unlocks the rq */

Pointer switch_count is from line 4091 or line 4111 of the same file.

PS: Link from perreal is great: http://oreilly.com/catalog/linuxkernel/chapter/ch10.html (search context_swtch)

System call and context switch

You need to understand that a thread/process context has multiple parts, one, directly associated with execution and is held in the CPU and certain system tables in memory that the CPU uses (e.g. page tables), and the other, which is needed for the OS, for bookkeeping (think of the various IDs, handles, special OS-specific permissions, network connections and such).

A full context switch would involve swapping both of these, the old current thread/process goes away for a while and the new current thread/process comes in for a while. That's the essence of thread/process scheduling.

Now, system calls are very different w.r.t. each other.

Consider something simple, for example, the system call for requesting the current date and time. The CPU switches from the user to kernel mode, preserving the user-mode register values, executes some kernel code to get the necessary data, stores it either in the memory or registers that the caller can access, restores the user-mode register values and returns. There's not much of context switch in here, only what's needed for the transition between the modes, user and kernel.

Consider now a system call that involves blocking of the caller until some event or availability of data. Manipulating mutexes and reading files would be examples of such system calls. In this case the kernel is forced to save the full context of the caller, mark it as blocked so the scheduler can't run it until that event or data arrives, and load the context of another ready thread/process, so it can run.

That's how system calls are related to context switches.

Kernel executing in the context of a user or a process means that whenever the kernel does work on behalf of a certain process or user it has to take into consideration that user's/process's context, e.g. the current process/thread/user ID, the current directory, locale, access permissions for various resources (e.g. files), all that stuff, that can be different between different processes/threads/users.

If processes have individual address spaces, the address spaces is also part of the process context. So, when the kernel needs to access memory of a process (to read/write file data or network packets), it has to have access to the process' address space, IOW, it has to be in its context (it doesn't mean, however, that the kernel has to load the full context just to access memory in a specific address space).

Is that helpful?

how to measure the cost of context switching more precisely

#define BILLION 1e9 //not 10e9
The code is OK. read() does not return 0 if there's no data in the pipe—it blocks.
That's why the ping pong you're doing effectively measures the cost
of context switches (+IO overhead).
read() returns 0 for the read end of a pipe only when all OS-counted references (created
via dup* functions or forking in conjuction with fd inheritance)
to the corresponding write end are closed.
You're effectively measuring context switches + the pipe's IO overhead. You can measure approximate IO overhead of the pipe separately by adapting the code to use just one pipe on a >=2 core system (so there's almost no context switch per an io call) and making one process a permanent reader and the other a permanent writer (https://pastebin.com/cGDWFdgQ). I'm getting about 2*0.55µs overhead + about 5.5µs for the whole thing so about 4.4µs per context switch).

How to measure the context switching overhead of a very large program?

Are you sure most of those 200 threads are actually waiting to run at the same time, not waiting for data from a system call? I guess you can tell from perf stat that context-switches are actually pretty high, but part of the question is whether they're high for the threads doing the critical work.

The cost of a context-switch is reflected in cache misses once a thread is running again. (And stopping OoO exec from finding as much ILP right at the interrupt boundary). This cost is more significant than the cost of the kernel code that saves/restores registers. So even if there was a way to measure how much time the CPUs spent in kernel context-switch code (possible with perf record sampling profiler as long as your perf_event_paranoid setting allows recording kernel addresses), that wouldn't be an accurate reflection of the true cost.

Even making a system call has a similar (but lower and more frequent) performance cost from serializing OoO exec, as well as disturbing caches (and TLB). There's a useful characterization of this on real modern CPUs (from 2010) in a paper by Livio & Stumm, especially the graph on the first page of IPC (instructions per cycle) dropping after a system call returns, and taking time to recover: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. (Conference presentation: https://www.usenix.org/conference/osdi10/flexsc-flexible-system-call-scheduling-exception-less-system-calls)

You might estimate context-switch cost by running the program on a system with enough cores not to need to context-switch much at all (e.g. a big many-core Xeon or Epyc), vs. on fewer cores but with the same CPUs / caches / inter-core latency and so on. So, on the same system with taskset --cpu-list 0-8 ./program to limit how many cores it can use.

Look at the total user-space CPU-seconds used: the amount higher is the extra amount of CPU time needed because of slowdowns from context switched. The wall-clock time will of course be higher when the same work has to compete for fewer cores, but perf stat includes a "task-clock" output which tells you a total time in CPU-milliseconds that threads of your process spent on CPUs. That would be constant for the same amount of work, with perfect scaling to more threads, and/or to the same threads competing for more / fewer cores.

But that would tell you about context-switch overhead on that big system with big caches and higher latency between cores than on a small desktop.

Getting a number of context switches for a process / thread

Well, let's examine the case. Linux type O/S keeps these details systematically and one may use a comfort of python, for both inspecting the state and also for easy design of a monitoring system, that can report any excessive circumstances ( the former quite matching a just out of curiosity cases, the latter quite handy for any re-work / re-use for systematic work ) :

A "Monitor" example for both { voluntary | involuntary }-Ctx Switching :

The python here serves for both the educational role and for the ease and comfort of further extending the scope of functionalities:

Having assigned signal.signal( signal.SIGALRM, SIG_ALRM_handler_A ) and the timing, the system gets ready to report both voluntary and involuntary ( enforced ) Context-Switches,

for which a "FAT"-blocking piece of computing was used, that resorts, due to historical reasons to non-GIL Numpy/C/FORTRAN code and thus gets disturbed by just involuntary-CtxSwitched cases, as was shown below

( len(str([np.math.factorial(2**f) for f in range(20)][-1])) )

but

by using a principally any other PID-number,

this trivial monitoring mechanics can serve for whatever other purposes:

########################################################################
### SIGALRM_handler_          
###

import psutil, resource, os, time

SIG_ALRM_last_ctx_switch_VOLUNTARY = -1
SIG_ALRM_last_ctx_switch_FORCED    = -1

def SIG_ALRM_handler_A( aSigNUM, aFrame ):                              # SIG_ALRM fired evenly even during [ np.math.factorial( 2**f ) for f in range( 20 ) ] C-based processing =======================================
    # onEntry_ROTATE_SigHandlers() -- MAY set another sub-sampled SIG_ALRM_handler_B() ... { last: 0, 0: handler_A, 1: handler_B, 2: handler_C }
    #
    # onEntry_SEQ of calls of regular, hierarchically timed MONITORS ( just the SNAPSHOT-DATA ACQUISITION Code-SPRINTs, handle later due to possible TimeDOMAIN overlaps )
    # 
    #
    # print( time.ctime() )
    # print( formatExtMemoryUsed( getExtMemoryUsed() ) )
    # print( 60 * "=", psutil.Process( os.getpid() ).num_ctx_switches(), "~~~", aProcess.cpu_percent( interval = 0 ) )
    #                                        ???                        # WHY CPU 0.0%
    aProcess         =   psutil.Process( os.getpid() )
    aProcessCpuPCT   =         aProcess.cpu_percent( interval = 0 )     # EVENLY-TIME-STEPPED
    aCtxSwitchNUMs   =         aProcess.num_ctx_switches()              # THIS PROCESS ( may inspect other per-incident later ... on anomaly )

    aVolCtxSwitchCNT = aCtxSwitchNUMs.voluntary
    aForcedSwitchCNT = aCtxSwitchNUMs.involuntary

    global SIG_ALRM_last_ctx_switch_VOLUNTARY
    global SIG_ALRM_last_ctx_switch_FORCED

    if (     SIG_ALRM_last_ctx_switch_VOLUNTARY != -1 ):                # .INIT VALUE STILL UNCHANGED
        #----------
        # .ON_TICK: must process delta(s)
        if ( SIG_ALRM_last_ctx_switch_VOLUNTARY == aVolCtxSwitchCNT ):
            #
            # AN INDIRECT INDICATION OF A LONG-RUNNING WORKLOAD OUTSIDE GIL-STEPPING ( regex / C-lib / FORTRAN / numpy-block et al )
            #                                                                                 |||||              vvv
            # SIG_:  Wed Oct 19 12:24:32 2016 ------------------------------ pctxsw(voluntary=48714, involuntary=315)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:24:37 2016 ------------------------------ pctxsw(voluntary=48714, involuntary=323)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:24:42 2016 ------------------------------ pctxsw(voluntary=48714, involuntary=331)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:24:47 2016 ------------------------------ pctxsw(voluntary=48714, involuntary=338)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:24:52 2016 ------------------------------ pctxsw(voluntary=48714, involuntary=346)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:24:57 2016 ------------------------------ pctxsw(voluntary=48714, involuntary=353)  ~~~  0.0
            # ...                                                                             |||||              ^^^
            # 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
            # >>>                                                                             |||||              |||
            #                                                                                 vvvvv              |||
            # SIG_:  Wed Oct 19 12:26:17 2016 ------------------------------ pctxsw(voluntary=49983, involuntary=502)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:26:22 2016 ------------------------------ pctxsw(voluntary=49984, involuntary=502)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:26:27 2016 ------------------------------ pctxsw(voluntary=49985, involuntary=502)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:26:32 2016 ------------------------------ pctxsw(voluntary=49986, involuntary=502)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:26:37 2016 ------------------------------ pctxsw(voluntary=49987, involuntary=502)  ~~~  0.0
            # SIG_:  Wed Oct 19 12:26:42 2016 ------------------------------ pctxsw(voluntary=49988, involuntary=502)  ~~~  0.0

            #rint(   "SIG_ALRM_handler_A(): A SUSPECT CPU-LOAD:: ", time.ctime(), 10 * "-",  aProcess.num_ctx_switches(), "{0: > 8.2f} CPU_CORE_LOAD [%]".format( aProcessCpuPCT ), " INSPECT processes ... ev. add a StateFull-self-Introspection" )
            print(   "SIG_ALRM_handler_A(): A SUSPECT CPU-LOAD:: ", time.ctime(), 10 * "-",  aProcess.num_ctx_switches(), "{0:_>60s}".format( str( aProcess.threads() ) ), " INSPECT processes ... ev. add a StateFull-self-Introspection" )
            #rint(   "SIG_ALRM_handler_A(): A SUSPECT CPU-LOAD:: ", str( resource.getrusage( resource.RUSAGE_SELF ) )[22:] )
    else:
        #----------
        # .ON_INIT: may report .INIT()
        #rint(   "SIG_ALRM_handler_A(): A SUSPECT CPU-LOAD:: ", time.ctime(), ...
        print(   "SIG_ALRM_handler_A(): activated            ", time.ctime(), 30 * "-",  aProcess.num_ctx_switches() )

    ##########
    # FINALLY:

    SIG_ALRM_last_ctx_switch_VOLUNTARY = aVolCtxSwitchCNT               # .STO ACTUALs
    SIG_ALRM_last_ctx_switch_FORCED    = aForcedSwitchCNT               # .STO ACTUALs

    #rint(   "SIG_: ", time.ctime(), 30 * "-",  aProcess.num_ctx_switches(), " ~~~ ", aProcess.cpu_percent( interval = 0 ), " % -?- ", aProcess.threads() )

#____________________________________________________________________
# SIG_ALRM_handler_A( aSigNUM, aFrame ):                      DEFINED
#####################################################################

##########
# FINALLY:
# 
# > signal.signal(    signal.SIGALRM, SIG_ALRM_handler_A )          # .ASSOC { SIGALRM: thisHandler }
# > signal.setitimer( signal.ITIMER_REAL, 10, 5 )                   # .SET   @5 [sec] interval, after first run, starting after 10[sec] initial-delay
# > signal.setitimer( signal.ITIMER_REAL,  0, 5 )                   # .UNSET
# > SIG_ALRM_last_ctx_switch_VOLUNTARY = -1                         # .RESET .INIT() the global { signalling | state }-variable
# > len(str([np.math.factorial(2**f) for f in range(20)][-1]))      # .RUN   A "FAT"-BLOCKING CHUNK OF A regex/numpy/C/FORTRAN-calculus

Also the Thread-level CtxSwitch details

While this was not elaborated to a similar depth, the same as above applies to:

>>> psutil.Process( 18263 ).cpu_percent()                           0.0
>>> psutil.Process( 18263 ).ppid()                                  18054

>>> psutil.Process( 18054 ).cpu_percent()                           0.0
=== ( 18054 ).threads(): [ 17679, 17680, 17681, 18054, 18265, 18266, 18267, ]
                                                                                                ==4 -------------vvv-------------------=4--------------vvvv-------------------=4--------------vvv
>>> [ psutil.Process( p ).num_ctx_switches() for p in ( 18259, 18260, 18261 ) ] [pctxsw(voluntary=4, involuntary=267), pctxsw(voluntary=4, involuntary=1909), pctxsw(voluntary=4, involuntary=444)]
>>> [ psutil.Process( p ).num_ctx_switches() for p in ( 18259, 18260, 18261 ) ] [pctxsw(voluntary=4, involuntary=273), pctxsw(voluntary=4, involuntary=1915), pctxsw(voluntary=4, involuntary=445)]
>>> [ psutil.Process( p ).num_ctx_switches() for p in ( 18259, 18260, 18261 ) ] [pctxsw(voluntary=4, involuntary=275), pctxsw(voluntary=4, involuntary=1917), pctxsw(voluntary=4, involuntary=445)]

Writing a Syscall to Count Context Switches of a Process

Writing a syscall to count context switches of a process

Writing a syscall to count context switches of a process

System call and context switch

how to measure the cost of context switching more precisely

How to measure the context switching overhead of a very large program?

Getting a number of context switches for a process / thread

A "Monitor" example for both { voluntary | involuntary }-Ctx Switching :

Also the Thread-level CtxSwitch details

Related Topics

Leave a reply