Does There Exist Kernel Stack for Each Process

Why keep a kernel stack for each process in linux?

What's the point in keeping a different kernel stack for each process in linux?

It simplifies pre-emption of processes in the kernel space.

Why not keep just one stack for the kernel to work with?

It would be a night mare to implement pre-emption without seperates stacks.

Separate kernel stacks are not really mandated. Each architecture is free to do whatever it wants. If there was no per-emption during a system call, then a single kernel stack might make sense.

However, *nix has processes and each process can make a system call. However, Linux allows one task to be pre-empted during a write(), etc and another task to schedule. The kernel stack is a snapshot of the context of kernel work that is being performed for each process.

Also, the per-process kernel stacks come with little overhead. A thread_info or some mechanism to get the process information from assembler is needed. This is at least a page allocation. By placing the kernel mode stack in the same location, a simple mask can get the thread_info from assembler. So, we already need the per-process variable and allocation. Why not use it as a stack to store kernel context and allow preemption during system calls?

The efficiency of preemption can be demonstrated by mentioned write above. If the write() is to disk or network, it will take time to complete. A 5k to 8k buffer written to disk or network will take many CPU cycles to complete (if synchronous) and the user process will block until it is finished. This transfer in the driver can be done with DMA. Here, a hardware element will complete the task of transferring the buffer to the device. In the mean time, a lower priority process can have the CPU and be allowed to make system calls when the kernel keeps different stacks per process. These stacks are near zero cost as the kernel already needs to have book keeping information for process state and the two are both keep in an 4k or 8k page.

kernel stack for linux process

There is just one common kernel memory. In it each process has it's own task_struct + kernel stack (by default 8K).

In a context switch the old stack pointer is saved somewhere and the actual stack pointer is made to point to the top of the stack (or bottom depending on the hardware architecture) of the new process which is going to run.

kernel stack and user space stack

What's the difference between kernel stack and user stack ?

In short, nothing - apart from using a different location in memory (and hence a different value for the stack pointer register), and usually different memory access protections. I.e. when executing in user mode, kernel memory (part of which is the kernel stack) will not be accessible even if mapped. Vice versa, without explicitly being requested by the kernel code (in Linux, through functions like copy_from_user()), user memory (including the user stack) is not usually directly accessible.

Why is [ a separate ] kernel stack used ?

Separation of privileges and security. For one, user space programs can make their stack (pointer) anything they want, and there is usually no architectural requirement to even have a valid one. The kernel therefore cannot trust the user space stack pointer to be valid nor usable, and therefore will require one set under its own control. Different CPU architectures implement this in different ways; x86 CPUs automatically switch stack pointers when privilege mode switches occur, and the values to be used for different privilege levels are configurable - by privileged code (i.e. only the kernel).

If a local variable is declared in an ISR, where will it be stored ?

On the kernel stack. The kernel (Linux kernel, that is) does not hook ISRs directly to the x86 architecture's interrupt gates but instead delegates the interrupt dispatch to a common kernel interrupt entry/exit mechanism which saves pre-interrupt register state before calling the registered handler(s). The CPU itself when dispatching an interrupt might execute a privilege and/or stack switch, and this is used/set up by the kernel so that the common interrupt entry code can already rely on a kernel stack being present.

That said, interrupts that occur while executing kernel code will simply (continue to) use the kernel stack in place at that point. This can, if interrupt handlers have deeply nested call paths, lead to stack overflows (if a deep kernel call path is interrupted and the handler causes another deep path; in Linux, filesystem / software RAID code being interrupted by network code with iptables active is known to trigger such in untuned older kernels ... solution is to increase kernel stack sizes for such workloads).

Does each process have its own kernel stack ?

Not just each process - each thread has its own kernel stack (and, in fact, its own user stack as well). Remember the only difference between processes and threads (to Linux) is the fact that multiple threads can share an address space (forming a process).

How does the process coordinate between both these stacks ?

Not at all - it doesn't need to. Scheduling (how / when different threads are being run, how their state is saved and restored) is the operating system's task and processes don't need to concern themselves with this. As threads are created (and each process must have at least one thread), the kernel creates kernel stacks for them, while user space stacks are either explicitly created/provided by whichever mechanism is used to create a thread (functions like makecontext() or pthread_create() allow the caller to specify a memory region to be used for the "child" thread's stack), or inherited (by on-access memory cloning, usually called "copy on write" / COW, when creating a new process).

That said, the process can influence scheduling of its threads and/or influence the context (state, amongst that is the thread's stack pointer). There are multiple ways for this: UNIX signals, setcontext(), pthread_yield() / pthread_cancel(), ... - but this is digressing a bit from the original question.

Does Linux keep the kernel stack for zombie processes?

After a bit of digging, I think I have finally found it...

void free_task(struct task_struct *tsk)
{
    prop_local_destroy_single(&tsk->dirties);
    account_kernel_stack(tsk->stack, -1);
    free_thread_info(tsk->stack);
    rt_mutex_debug_task_free(tsk);
    ftrace_graph_exit_task(tsk);
    free_task_struct(tsk);
}

on parent verifying the zombie,

put_task_struct()->__put_task_struct()->free_task() does free the kernel stack.

So, the answer is Yes. Zombie processes do keep the kernel stack.

What is a kernel stack used for?

I have a disagreement with the second paragraph.

Process A is running and then is interrupted by the timer interrupt. The hardware saves its registers (onto its kernel stack) and enters the kernel (switching to kernel mode).

I am not aware of a system that saves all the registers on the kernel stack on an interrupt. Program Counter, Processor Status, and Stack Pointer (assuming the hardware does not have a separate Kernel Mode Stack Pointer). Normally, processors save the minimum necessary on the kernel stack after an interrupt. The interrupt handler will then save any additional registers it wants to use and restores them before exit. The processor's RETURN FROM INTERRUPT or EXCEPTION instruction then restores the registers automatically stored by the interrupt.

That description assumes no change in the process.

If the interrupt handle decides to change the process, it saves the current register state (the "process context" --most processors have a single instruction for this. In Intel land you might have to use multiple instructions) then executes another instruction to load the process context of the new process.

To answer your heading question "What is a kernel stack used for?", it is used whenever the processor is in Kernel mode. If the kernel did not have a stack protected from user access, the integrity of the system could be compromised. The kernel stack tends to be very small.

To answer you second question, "What exactly is the point of saving the registers to both the kernel stack and the process structure and why the need for both?"

They serve two different purpose. The saved registers on the kernel stack are used to get out of kernel mode. The context process block saves the entire register set in order to change processes.

I think your misunderstanding comes from the wording of your source that suggests all registers are stored on the stack when entering kernel mode, rather than just the minimum number of registers needed to make the kernel mode switch. The system will usually only save what it needs to get back to user mode (and may use that same information to return back to the original process in another context switch, depending upon the system). The change in process context saves all the registers.

Edits to answer additional questions:

If the interrupt handler needs to use register not saved by the CPU automatically by the interrupt, it pushes them on the kernel stack on entry and pops them off on exit. The interrupt handler has to explicitly save and restore any [general] registers it uses. The Process Context Block does not get touched for this.

The Process Context Block only gets altered as part an actual context switch.

Example:

Lets assume we have a processor with a program counter, stack pointer, processor status and 16 general registers (I know no such system really exists) and that the same SP is used for all modes.

Interrupt occurs.

The hardware pushes the PC, SP, and PS on to the stack, loads the SP with the address of the kernel mode stack and the PC from the interrupt handler (from the processor's dispatch table).

Interrupt handler gets called.

The writer of the handler decides he is going to us R0-R3. So the first lines of the handler have:

Push R0  ; on to the kernel mode stack
Push R1
Push R2
Push R3

The interrupt handler does whatever it wants to do.
Cleanup

The writer of the interrupt handler needs to do:

Pop R3
Pop R2
Pop R1
Pop R0 
REI      ; Whatever the system's return from interrupt or exception instruction is.

Hardware Takes over

Restores the PS, PC, and SP from the kernel mode stack, then resumes executing where it was before the interrupt.

I've made up my own processor for simplification. Some processors have lengthy instructions that are interruptable (e.g. block character moves). Such instructions often use registers to maintain their context. On such a system, the processor would have to automatically save any registers is uses to maintain context within the instruction.

An interrupt handler does not muck with the process context block unless it is changing processes.