Forcing a Context Switch from The Userland on Linux

Forcing a context switch from the userland on Linux?

You shouldn't be trying to control context switches, as the kernel probably knows much better than you when to do it. Specially since a context switch is considered a costly operation (due to cache invalidation, mostly).

But, if you really need to force a scheduler cycle, you can use the sched_yield() system call.

Can preemptive multitasking of native code be implemented in user space on Linux?

You cannot reliably change contexts inside signal handlers. (if you did that from some signal handler, it would usually work in practice, but not always, hence it is undefined behavior).

You could set some volatile sig_atomic_t flag (read about sig_atomic_t) in a signal handler (see signal(7), signal-safety(7), sigreturn(2) ...) and check that flag regularly (e.g. at least once every few milliseconds) in your code, for example before most calls, or inside your event loop if you have one, etc... So it becomes cooperative user-land scheduling.

It is easier to do if you can change the code, e.g. when you design some compiler which emits C code (a common practice), or if you hack your C compiler to emit such tests. Then you'll change your code generator to sometimes emit such a test in the generated code.

You may want to forbid blocking system calls and replace them with non-blocking variants or wrappers. See also poll(2), fcntl(2) with F_SETFL and O_NONBLOCK, etc...

You may want the code generator to avoid large call stacks, e.g. like GCC's -fsplit-stack instrumentation option does (read about splitstacks in GCC).

And if you generate (or write some) assembler, you can use such tricks. AFAIK the Go compiler uses something similar for its goroutines. Study your ABI, e.g. from here.

However, kernel initiated preemptive scheduling is preferable (and on Linux will still happen between processes or kernel tasks, see clone(2)).

PS. If garbage collection techniques using similar tricks interest you, look into MPS and Cheney on the MTA (e.g. into Chicken Scheme).

What exactly makes Erlang process, green thread, coroutine lighter than kernel thread? What about context switching that's heavy?

Context switching on preemptive, monolithic, multitasking operating systems involves one of two paths, either an implicit yield to the scheduler via some system service call (sleep, mutex acquire, waiting for an event, blocking I/O) or via an interrupt and the scheduler deciding to swap running tasks.

When a task is swapped by the scheduler, a few heavyweight things happen:

All the action happens as part of the operating system kernel, running in a high level of privilege. Every action is checked (or it should be) to ensure that decisions made by the scheduler don't grant a task any additional privileges (think local root exploit)
User-mode process address spaces are swapped. This results in the memory manager dorking around with the page table layouts and loading a new directory base into a control register.
This also means that data kept in the CPU cache could be shot down and purged. This would suck if your task had just accessed a bunch of frequently used stuff, then context switched and "lost" it (on next access it would [probably] have to be fetched from main memory again)
Depending on how you trap into the kernel, you then need to trap OUT of the kernel. If you make a system call, for example, the CPU will go through a very precise set of steps to transition to code running in the kernel, and then on exit, unwind those steps. These steps are more complicated than making a function call to another module in your program, so they take more time.

A green thread task is pretty straightforward, as I understand it. A user-mode dispatcher directs a coroutine to run until the coroutine yields. A few differences in above:

None of the dispatching of coroutines happens in kernel mode, indeed dispatching green threads generally does not need to involve any operating system services, or any blocking operating system services. So all of this can happen without any context switches or any user/kernel translation.
A green thread isn't preemptively scheduled, or even preempted at all by the green thread manager, they are scheduled co-operatively. This is good and bad, but with well written routines, generally good. Each task does precisely what it needs to do and then traps back to the dispatcher, but without any context swap overhead.
Green threads share their address space (as far as I know). No swapping of the address space happens on context switches. Stacks are (as far as I know) swapped, but those stacks are managed by the dispatcher and also swapping a stack is a simple write into a register. Swapping a stack is also a nonpriviliged operation.

In short, a context switch in user mode involves a few library calls and writing to the stack pointer register. A context switch in kernel mode involves interrupts, user/kernel transition, and system level behavior like address space state changes and cache flushes.

Non-preemptive Pthreads?

You can use Pth (a.k.a. GNU Portable Threads), a non-preemptive thread library. Configuring it with --enable-pthread will create a plug-in replacement for pthreads. I just built and tested this on my Mac and it works fine for a simple pthreads program.

From the README:

Pth is a very portable POSIX/ANSI-C based library for Unix platforms
which provides non-preemptive priority-based scheduling for multiple
threads of execution (aka `multithreading') inside event-driven
applications. All threads run in the same address space of the server
application, but each thread has its own individual program-counter,
run-time stack, signal mask and errno variable.

The thread scheduling itself is done in a cooperative way, i.e., the
threads are managed by a priority- and event-based non-preemptive
scheduler. The intention is, that this way one can achieve better
portability and run-time performance than with preemptive scheduling.
The event facility allows threads to wait until various types of
events occur, including pending I/O on filedescriptors, asynchronous
signals, elapsed timers, pending I/O on message ports, thread and
process termination, and even customized callback functions.

Additionally Pth provides an optional emulation API for POSIX.1c
threads (`Pthreads') which can be used for backward compatibility to
existing multithreaded applications.

How are threads/processes parked and woken in Linux, prior to futex?

Before futex and current implementation of pthreads for Linux, the NPTL (require kernel 2.6 and newer), there were two other threading libraries with POSIX Thread API for Linux: linuxthreads and NGPT (which was based on Gnu Pth. LinuxThreads was the only widely used libpthread for years (and it can still be used in some strange & unmaintained micro-libc to work on 2.4; other micro-libc variants may have own builtin implementation of pthread-like API on top of futex+clone). And Gnu Pth is not thread library, it is single process thread with user-level "thread" switching.

You should know that there are several Threading Models when we check does the kernel knows about some or all of user threads (how many CPU cores can be used with adding threads to the program; what is the cost of having the thread / how many threads may be started). Models are named as M:N where M is userspace thread number and N is thread number schedulable by OS kernel:

"1:1" ''kernel-level threading'' - every userspace thread is schedulable by OS kernel. This is implemented in Linuxthreads, NPTL and many modern OS.
"N:1" ''user-level threading'' - userspace threads are planned by the userspace, they all are invisible to the kernel, it only schedules one process (and it may use only 1 CPU core). Gnu Pth (GNU Portable Threads) is example of it, and there are many other implementations for some computer architectures.
"M:N" ''hybrid threading'' - there are some entities visible and schedulable by OS kernel, but there may be more user-space threads in them. And sometimes user-space threads will migrate between kernel-visible threads.

With 1:1 model there are many classic sleep mechanisms/APIs in Unix like select/poll and signals and other variants of IPC APIs. As I remember, Linuxthreads used separate processes for every thread (with fully shared memory) and there was special manager "thread" (process) to emulate some POSIX thread features. Wikipedia says that SIGUSR1/SIGUSR2 were used in Linuxthreads for some internal communication between threads, same says IBM "The synchronization of primitives is achieved by means of signals. For example, threads block until awoken by signals.". Check also the project FAQ http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html#H.4 "With LinuxThreads, I can no longer use the signals SIGUSR1 and SIGUSR2 in my programs! Why?"

LinuxThreads needs two signals for its internal operation. One is used to suspend and restart threads blocked on mutex, condition or semaphore operations. The other is used for thread cancellation.
On ``old'' kernels (2.0 and early 2.1 kernels), there are only 32 signals available and the kernel reserves all of them but two: SIGUSR1 and SIGUSR2. So, LinuxThreads has no choice but use those two signals.

With "N:1" model thread may call some blocking syscall and block everything (some libraries may convert some blocking syscalls into async, or use some SIGALRM or SIGVTALRM magic); or it may call some (very) special internal threading function which will do user-space thread switching by rewriting machine state register (like switch_to in linux kernel, save IP/SP and other regs, restore IP/SP and regs of other thread). So, kernel does not wake any user thread directly from userland, it just schedules whole process; and user space scheduler implement thread synchronization logic (or just calls sched_yield or select when there is no threads to work).

With M:N model things are very complicated... Don't know much about NGPT... There is one paragraph about NGPT in POSIX Threads and the Linux Kernel, Dave McCracken, OLS2002,330 page 5

There is a new pthread library under development called NGPT. This library is based on the GNU Pth library, which is an M:1 library. NGPT extends Pth by using multiple Linux tasks, thus creating an M:N library. It attempts to preserve Pth’s pthread compatibility while also using multiple Linux tasks for concurrency, but this effort is hampered by the underlying differences in the Linux threading model. The NGPT library at present uses non-blocking wrappers around blocking system calls to avoid
blocking in the kernel.

Some papers and posts: POSIX Threads and the Linux Kernel, Dave McCracken, OLS2002,330, LWN post about NPTL 0.1

The futex system call is used extensively in all synchronization
primitives and other places which need some kind of
synchronization. The futex mechanism is generic enough to support
the standard POSIX synchronization mechanisms with very little
effort. ... Futexes also allow the implementation of inter-process
synchronization primitives, a sorely missed feature in the old
LinuxThreads implementation (Hi jbj!).

NPTL design pdf:

5.5 Synchronization Primitives
The implementation of the synchronization primitives such as mutexes, read-write
locks, conditional variables, semaphores, and barriers requires some form of kernel
support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application.
Fortunately some new functionality was added to the kernel to implement all kinds
of synchronization primitives: futexes [Futex]. The underlying principle is simple but
powerful enough to be adaptable to all kinds of uses. Callers can block in the kernel
and be woken either explicitly, as a result of an interrupt, or after a timeout.

How does Linux handle a call to schedule() from within an IRQ?

The short answer is that it doesn't. On normal Linux systems, ISR context is considered an atomic context meaning that you should not yield control to the scheduler at any point. In case some code does call schedule() from an interrupt context, you would most likely get a "BUG: scheduling while atomic" print.

However, it is possible to re-schedule the process following the end of the interrupt handling and that's what the time interrupt occasionally does in order to divide the CPU resource among processes. Also, some patched Linux kernels have delegated the work of ISRs to kernel threads, and in that case those special ISRs might sleep.

What is the downside of updating ARM TTBR(Translate Table Base Register)?

Updating the TTBR (translation table base register)^Note1 with the MMU enables has many perils. There are interrupts, page faults, TLB (MMU-cache) and both L1 and L2 caches to be considered. On different systems, the caches maybe PIPT or VIVT (physically or virtually tagged), there may or may not exist L1 nor L2 caches.

People seem overly concerned about the MMU and TLB for efficiency. They are always dwarfed by the primary L1/L2 caches in performance considerations. It is a smaller impact to update the MMU tables and perform TLB flushes than it is to have un-needed evictions from the L1/L2 code and data caches. At a minimum a TLB is worth 1/4KB or over 1/100 cache lines (cost to repopulate). In some cases, the TLB entry maybe 1MB.

Some data/code in the L1/L2 user space may need to be evicted on context switches. However, for very frequent small work-loads, a user context switch may keep code and data in the L1/L2. For example a media player doing large CPU intensive decoding and some cron task checking to see no new email is on a server. The switch to and back from the 'cron' task may result in code remaining in the L2 cache for the video decoding to use.

What is the downside of updating ARM TTBR?

Unless the from/to tables are identical you have to keep the system view of memory consistent for the duration of the update.^Note2 This will naturally cause IRQ latency and complexity of implementation as you need to sync up many sub-systems. Also, the Linux MM (memory management) code is architecture agnostic. It handles a great variety of MMU sub-systems. The goal is never to optimize locally (at the architecture level) but optimize globally at the generic layers.

Note1: The TTBR is a pointer to a physical 16k aligned memory region that is the first level of the ARM MMU. Each entry is 1MB (on 32bit systems) and may point to another table; often called L2.

Note2: You might do this in a boot loader or places where you are migrating system level code between memory devices. Ie, update the TTBR with identical tables is of no consequence by itself. It is when the tables differ that weird things will happen.

Forcing a Context Switch from The Userland on Linux