Internals of a Linux System Call

Internals of a Linux system call

A crash course in kernel mode in one stack overflow answer

Good questions! (Interview questions?)

What happens (in detail) when a
thread makes a system call by raising
interrupt 80?

The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.

What work does Linux do to the
thread's stack and other state?

Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.

What changes are done to the
processor to put it into kernel mode?

This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.

After running the interrupt handler,
how is control restored back to the
calling process?

There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.

What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?

The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.

What happens to a process when it performs a system call in Linux?

I would change a little bit the view. Who "changes" is the core that the process is running. Changing this point of view is important. Important because the process per se is just an abstraction, it is just an entry on a linked list somewhere in the kernel space. The actor, the active executor is the cpu core.

So the core is executing this process' code and it instructed the core to execute a system call. The core then changes into kernel mode and starts executing kernel code on behalf of that process.

There is really no interrrupt in this path - or at least not anymore. Yes, before we used interrupts. With Pentium and above a fast path into the kernel is through SYSENTER/SYSEXIT instructions. Note that we are talking strictly about X86 here.

Interrupts still happen when something other than the kernel cores, say a network card needs the cpu cores to handle some data.

While the core is executing in kernel space, it is still being counted as running "as the process", as shown in this example

$ /bin/time dd if=/dev/zero of=/tmp/zero
^C1903866+0 records in
1903866+0 records out
974779392 bytes (975 MB, 930 MiB) copied, 2.27104 s, 429 MB/s
Command terminated by signal 2
0.23user 2.03system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 2264maxresident)k
160inputs+1903872outputs (2major+92minor)pagefaults 0swaps

This just copies zero into some file in temp. As it's only zeros, most of the time is spent in I/O inside the kernel. You can see this on the 0.23user 2.03system 0:02.27 elapsed line - 2.03 seconds were spent in the kernel on behalf of this process.

How does a syscall actually happen on linux?

Assuming we're talking about x86:

The ID of the system call is deposited into the EAX register
Any arguments required by the system call are deposited into the locations dictated by the system call. For example, some system calls expect their argument to reside in the EBX register. Others may expect their argument to be sitting on the top of the stack.
An INT 0x80 interrupt is invoked.
The Linux kernel services the system call identified by the ID in the EAX register, depositing any results in pre-determined locations.
The calling code makes use of any results.

I may be a bit rusty at this, it's been a few years...

How does a system call work

In short, here's how a system call works:

First, the user application program sets up the arguments for the system call.

After the arguments are all set up, the program executes the "system call" instruction.

This instruction causes an exception: an event that causes the processor to jump to a new address and start executing the code there.

The instructions at the new address save your user program's state, figure out what system call you want, call the function in the kernel that implements that system call, restores your user program state, and returns control back to the user program.

A visual explanation of a user application invoking the open() system call:

Sample Image

It should be noted that the system call interface (it serves as the link to system calls made available by the operating system) invokes intended system call in OS kernel and returns status of the system call and any return values. The caller need know nothing about how the system call is implemented or what it does during execution.

Another example: A C program invoking printf() library call, which calls write() system call

Sample Image

For more detailed explanation read section 1.5.1 in CH-1 and Section 2.3 in CH-2 from Operating System Concepts.

Linux System Call Flow Sequence

Note: I work predominately with ARM machines so some of these things might be ARM specific. Also, I'm going to try and simplify it as much as I can. Feel free to correct anything that might be wrong or oversimplified.

Lets say the thread makes a system call. I am a bit unclear on the workings after this. The Interrupt will generate a call. One of my questions is who will answer this call?

Usually, the processor will start executing at some predetermined location in kernel mode. The kernel will save the current process state and look at the userspace registers to determine which system call was requested and dispatch that to the correct system call handler.

So the Kernel will lookup the Interrupt Vector Table and get the routine which needs to be executed. My next question is which stack will be used in the execution of the Interrupt? Will it be the Kernel Thread's Stack or the User level Thread's Stack? (I am assuming that it will be the Kernel Thread's Stack.)

I'm pretty sure it will switch to a kernel stack. There would be some pretty severe security problems with information leaks if they used the userspace stack.

Coming back to the flow of the program lets say the operation is opening a file using fopen. The subsequent question I have is how will the jump from the ISR to System Call take place? Or is our ISR mapped to a System Call?

fopen() is actually a libc function and not a system call itself. It may (and in most cases will) call the open() syscall in its implementation though.

So, the process (roughly) is:

Userspace calls fopen()
fopen performs a system call to open()
This triggers some sort of exception or interrupt. In response, the processor switches into a more privileged mode and starts executing at some preset location in the kernel.
Kernel determines what kind of interrupt and exception it is and handles it appropriately. In our case, it will be a system call.
Kernel determines which system call is being requested by reading the userspace registers and extracts any arguments and passes it to the appropriate handler.
Handler runs.
Kernel puts any return code into userspace registers.
Kernel transfers execution back to where the exception occured.

Also at a more broader picture when the Kernel Thread is being executed I am assuming that the "OS region" on the RAM will be used to house the pages which are executing the System Call.

Pages don't execute anything :) Usually, in Linux, any address mapped above 0xC0000000 belongs to the kernel.

Again looking at it from a different angle (Hope your still with me) finally I am assuming that the corresponding Kernel Thread is being handled by the CPU Scheduler where in a context switch would have happened from the User Level Thread to the corresponding Kernel Level Thread when the fopen System Call was being answered.

With a preemptive kernel, threads effectively aren't discriminated against. With my understanding, a new thread isn't created for the purpose of servicing a system call - it just runs in the same thread from which the system call was requested in, except in kernel mode.

That means a thread that is in kernel mode servicing a system call can be scheduled out just the same as any other thread. Hence, this is where you hear about 'userspace context' when developing for the kernel. It means it's executing in kernel mode on a usermode thread.

It's a little difficult to explain this so I hope I got it right.

In a system call are hardware and software context saved?

Each system call has a wrapper function as already mentioned, each wrapper function triggers interrupt 128, int 0x80, which automatically saves on kernel stack the registers eip, esp, cs, ss, eflags.
In the handling function a SAVE_ALL macro is invoked and it will push the rest of the registers on stack, when the system call is served, the values are poped to restore previous state. A iret command is invoked and the CPU pops the 5 registers it saved previously.

How are parameters passed to Linux system call ? Via register or stack?

(This answer is for 32-bit x86 Linux to match your question; things are slightly different for 64-bit x86 and other architectures.)

The parameters are passed from userspace in registers as Love says.

When userspace invokes a system call with int $0x80, the kernel syscall entry code gets control. This is written in assembly language and can be seen here, for instance. One of the things this code does is to take the parameters from the registers and push them onto the stack, and then call the appropriate kernel sys_XXX() function (which is written in C). So those functions do indeed expect their arguments on the stack.

It wouldn't work as well to try to pass parameters from userspace to the kernel on the stack. When the system call is made, the CPU switches to a separate kernel stack, so the parameters would have to be copied from the userspace stack to the kernel stack, and this is somewhat complicated. And it would have to be done even for very simple system calls that just take a few numeric arguments and wouldn't otherwise need to access userspace memory at all (think about close() for instance).

Internals of a Linux System Call