Why Disable One Local Interrupt or Preemption Can Cause The Whole System with 4 Cpus Unresponsive

why disable one local interrupt or preemption can cause the whole system with 4 cpus unresponsive

It's possible that the kernel wants to execute an operation on all CPU's, such as an RCU synchronize, or cache-related synchronization or whatever. Then you're hosed.

SMP is not a license to carelessly hog a processor to yourself.

That kind of thing can be arranged. I mean you could have a CPU that is not online as far as the kernel is concerned, and which you use to run whatever you want.

Why do kprobes disable preemption and when is it safe to reenable it?

At least on x86, the implementation of Kprobes relies on the fact that preemption is disabled while the Kprobe handlers run.

When you place an ordinary (not Ftrace-based) Kprobe on an instruction, the first byte of that instruction is overwritten with 0xcc (int3, "software breakpoint"). If the kernel tries to execute that instruction, a trap occurs and kprobe_int3_handler() is called (see the implementation of do_int3()).

To call your Kprobe handlers, kprobe_int3_handler() finds which Kprobe hit, saves it as percpu variable current_kprobe and calls your pre-handler. After that, it prepares everything to single-step over the original instruction. After the single-stepping, your post-handler is called and then some cleanup is performed. current_kprobe and some other per-cpu data are used to do all this. Preemption is only enabled after that.

Now, imagine the pre-handler has enabled preemption, was preempted right away and resumed on a different CPU. If the implementation of Kprobes tried to access current_kprobe or other per-cpu data, the kernel would likely crash (NULL pointer deref if there were no current_kprobe on that CPU at the moment) or worse.

Or, the preempted handler could resume on the same CPU but another Kprobe could hit there while it was sleeping - current_kprobe, etc. would be overwritten and disaster would be very likely.

Re-enabling preemption in Kprobe handlers could result in difficult-to-debug kernel crashes and other problems.

So, in short, this is because Kprobes are designed this way, at least on x86. I cannot say much about their implementation on other architectures.

Depending on what you are trying to accomplish, other kernel facilities might be helpful.

For instance, if you only need to run your code at the start of some functions, take a look at Ftrace. Your code would then run in the same conditions as the functions you hook it into.

All that being said, it was actually needed in one of my projects to use Kprobes so that the handlers were running in the same conditions w.r.t. preemption as the probed instructions. You can find the implementation here. However, it had to jump through the hoops to achieve that without breaking anything. It has been working OK so far but it is more complex than I would like, has portability issues too.

Why linux disables kernel preemption after the kernel code holds a spinlock?

The answer to your first question is the reasoning behind your second.

Spinlocks acquired by the kernel may be implemented by turning off preemption, because this ensures that the kernel will complete its critical section without another process interfering. The entire point is that another process will not be able to run until the kernel releases the lock.

There is no reason that it has to be implemented this way; it is just a simple way to implement it and prevents any process from spinning on the lock that the kernel holds. But this trick only works for the case in which the kernel has acquired the lock: user processes can not turn off preemption, and if the kernel is spinning (i.e. it tries to acquire a spinlock but another process already holds it) it better leave preemption on! Otherwise the system will hang since the kernel is waiting for a lock that will not be released because the process holding it can not release it.

The kernel acquiring a spinlock is a special case. If a user level program acquires a spinlock, preemption will not be disabled.

Why disabling interrupts disables kernel preemption and how spin lock disables preemption

I am not a scheduler guru, but I would like to explain how I see it.
Here are several things.

preempt_disable() doesn't disable IRQ. It just increases a thread_info->preempt_count variable.
Disabling interrupts also disables preemption because scheduler isn't working after that - but only on a single-CPU machine. On the SMP it isn't enough because when you close the interrupts on one CPU the other / others still does / do something asynchronously.
The Big Lock (means - closing all interrupts on all CPUs) is slowing the system down dramatically - so it is why it not anymore in use. This is also the reason why preempt_disable() doesn't close the IRQ.

You can see what is preempt_disable(). Try this:
1. Get a spinlock.
2. Call schedule()

In the dmesg you will see something like "BUG: scheduling while atomic". This happens when scheduler detects that your process in atomic (not preemptive) context but it schedules itself.

Good luck.

Confusion around spin_lock_irqsave: in what nested situation is interrupt state kept?

There are many Q&As about spinlocks, but it's still confusing to me. I think it's because the questions and answers assume different settings or not clearly explain the settings about if it's SMP or if it's preemptive kernel or not when they ask or answer (and some old info is mixed in too).

I can only agree. Spinlocks, while simple in nature, are not a simple topic at all when included in the context of modern Linux kernels. I don't think you can get a good understanding of spinlocks just by reading random and case-specific Stack Overflow answers.

I would strongly suggest you to read Chapter 5: Concurrency and Race Conditions of the book Linux Device Drivers, which is freely available online. In particular, the "Spinlocks" section of Chapter 5 is very helpful to understand how spinlocks are useful in different situations.

(Q1) in SMP situation, is schedule() run on every processor concurrently? [...] I would appreciate it if someone could briefly explain it to me how processes move processor cores during scheduling too.

Yes, you can look at it that way if you like. Each CPU (i.e. every single processor core) has its own timer, and when a timer interrupt is raised on a given CPU, that CPU executes the timer interrupt handler registered by the kernel, which calls the scheduler, which re-schedules processes.

Each CPU in the system has its own runqueue, which holds tasks that are in a runnable state. Any task can be included in at most one runqueue and cannot run on multiple different CPUs at the same time.

The CPU affinity of a task is what determines which CPUs a task can be run on. The default "normal" affinity allows a task to run on any CPU (except in special configurations). Based on their affinity, tasks can be moved from one runqueue to another either by the scheduler or if they require so through the sched_setaffinity syscall (here's a related answer which explains how).

Suggested read: A complete guide to Linux process scheduling.

Suppose there is a code which calls spin_lock_irqsave, and the interrupt state (enable) was a 'disable' at the time of calling spin_lock_irqsave(). Could this code be running in interrupt context? Probably not.

Why not? This is possible. The code could be running in interrupt context, but not called by a different interrupt. See the bottom of my answer.

Case 1: a previous interrupt routine had been preempted by this process (which is calling this spin_lock_irqsave). This is weird because ISR cannot be preempted.

You're right, it's weird. More than weird though, this is impossible. On Linux, at all times, interrupts can either be enabled or disabled (there is no in-between). There isn't really a "priority" for interrupts (like there is for tasks), but we can classify them in two ranks:

Non-preemptible interrupts which necessarily need to run from start to finish with full control of the CPU. These interrupts put the system in the "disabled interrupts" state and no other interrupts can happen.
Preemptible interrupts which are re-entrant and allow other interrupts to happen. In case another interrupt happens when this interrupt is being serviced, you enter in a nested interrupt scenario, which is similar to the scenario of nested signal handlers for tasks.

In your case, since interrupts had previously been disabled, this means that if the code that disabled them was an interrupt, it was a non-preemptible one, and therefore it could not have been preempted. It could also have been a preemptible interrupt which is executing a critical portion of code that needs interrupts to be disabled, but the scenario is still the same, you cannot be inside another interrupt.

(Q2) By the way, in preemptive kernel, can ISR be preempted by a process?

No. It's improper to say "preempted by a process". Processes do not really preempt anything, they are preempted by the kernel which takes control. That said, a preemptible interrupt could in theory be interrupted by another one that was for example registered by a process (I don't know an example case for this scenario unfortunately). I still wouldn't call this "preempted by a process" though, since the whole thing keeps happening in kernel space.

(Q3) [...] Do interrupts also have the current thread info?

Interrupt handlers live in a different world, they do not care about running tasks and do not need access to such information. You probably could get ahold of current or even current_thread_info if you really wanted, but I doubt that'd be of any help for anything. An interrupt is not associated with any task, there's no link between the interrupt and a certain task running. Another answer here for reference.

Case 2: a previous normal process had acquired the lock with spin_lock_irq (or irqsave). But this also is weird because before locking, spin_lock_irq (or irqsave) disables preemption and interrupt for the task telling the scheduler not to switch to other task after the scheduler timer interrupt. So this case cannot be true.

Yes, you're right. That's not possible.

The spin_lock_irqsave() function exists to be used in circumstances in which you cannot know if interrupts have already been disabled or not, and therefore you cannot use spin_lock_irq() followed by spin_unlock_irq() because that second function would forcibly re-enable interrupts. By the way, this is also explained in Chapter 5 of Linux Device Drivers, which I linked above.

In the scenario you describe, you are calling spin_lock_irqsave() and interrupts have already been disabled by something else. This means that any of the parent caller functions that ended up calling the current function must have already disabled interrupts somehow.

The following scenarios are possible:

The original interrupt disable was caused by an interrupt handler and you are now executing another piece of code as part of the same interrupt handler (i.e. the current function has been called either directly or indirectly by the interrupt handler itself). You can very well have a call to spin_lock_irqsave() in a function that is being called by an interrupt handler. Or even just a call to local_irqsave() (the kfree() function does this for example, and it can surely be called from interrupt context).
The original interrupt disable was caused by normal kernel code and you are now executing another piece of code as part of the same normal kernel code (i.e. the current function has been called either directly or indirectly by some other kernel function after disabling interrupts). This is completely possible, and in fact it's the reason why the irqsave variant exists.

When should one use a spinlock instead of mutex?

The Theory

In theory, when a thread tries to lock a mutex and it does not succeed, because the mutex is already locked, it will go to sleep, immediately allowing another thread to run. It will continue to sleep until being woken up, which will be the case once the mutex is being unlocked by whatever thread was holding the lock before. When a thread tries to lock a spinlock and it does not succeed, it will continuously re-try locking it, until it finally succeeds; thus it will not allow another thread to take its place (however, the operating system will forcefully switch to another thread, once the CPU runtime quantum of the current thread has been exceeded, of course).

The Problem

The problem with mutexes is that putting threads to sleep and waking them up again are both rather expensive operations, they'll need quite a lot of CPU instructions and thus also take some time. If now the mutex was only locked for a very short amount of time, the time spent in putting a thread to sleep and waking it up again might exceed the time the thread has actually slept by far and it might even exceed the time the thread would have wasted by constantly polling on a spinlock. On the other hand, polling on a spinlock will constantly waste CPU time and if the lock is held for a longer amount of time, this will waste a lot more CPU time and it would have been much better if the thread was sleeping instead.

The Solution

Using spinlocks on a single-core/single-CPU system makes usually no sense, since as long as the spinlock polling is blocking the only available CPU core, no other thread can run and since no other thread can run, the lock won't be unlocked either. IOW, a spinlock wastes only CPU time on those systems for no real benefit. If the thread was put to sleep instead, another thread could have ran at once, possibly unlocking the lock and then allowing the first thread to continue processing, once it woke up again.

On a multi-core/multi-CPU systems, with plenty of locks that are held for a very short amount of time only, the time wasted for constantly putting threads to sleep and waking them up again might decrease runtime performance noticeably. When using spinlocks instead, threads get the chance to take advantage of their full runtime quantum (always only blocking for a very short time period, but then immediately continue their work), leading to much higher processing throughput.

The Practice

Since very often programmers cannot know in advance if mutexes or spinlocks will be better (e.g. because the number of CPU cores of the target architecture is unknown), nor can operating systems know if a certain piece of code has been optimized for single-core or multi-core environments, most systems don't strictly distinguish between mutexes and spinlocks. In fact, most modern operating systems have hybrid mutexes and hybrid spinlocks. What does that actually mean?

A hybrid mutex behaves like a spinlock at first on a multi-core system. If a thread cannot lock the mutex, it won't be put to sleep immediately, since the mutex might get unlocked pretty soon, so instead the mutex will first behave exactly like a spinlock. Only if the lock has still not been obtained after a certain amount of time (or retries or any other measuring factor), the thread is really put to sleep. If the same code runs on a system with only a single core, the mutex will not spinlock, though, as, see above, that would not be beneficial.

A hybrid spinlock behaves like a normal spinlock at first, but to avoid wasting too much CPU time, it may have a back-off strategy. It will usually not put the thread to sleep (since you don't want that to happen when using a spinlock), but it may decide to stop the thread (either immediately or after a certain amount of time; this is called "yielding") and allow another thread to run, thus increasing chances that the spinlock is unlocked (you still have the costs of a thread switch but not the costs of putting a thread to sleep and waking it up again).

Summary

If in doubt, use mutexes, they are usually the better choice and most modern systems will allow them to spinlock for a very short amount of time, if this seems beneficial. Using spinlocks can sometimes improve performance, but only under certain conditions and the fact that you are in doubt rather tells me, that you are not working on any project currently where a spinlock might be beneficial. You might consider using your own "lock object", that can either use a spinlock or a mutex internally (e.g. this behavior could be configurable when creating such an object), initially use mutexes everywhere and if you think that using a spinlock somewhere might really help, give it a try and compare the results (e.g. using a profiler), but be sure to test both cases, a single-core and a multi-core system before you jump to conclusions (and possibly different operating systems, if your code will be cross-platform).

Update: A Warning for iOS

Actually not iOS specific but iOS is the platform where most developers may face that problem: If your system has a thread scheduler, that does not guarantee that any thread, no matter how low its priority may be, will eventually get a chance to run, then spinlocks can lead to permanent deadlocks. The iOS scheduler distinguishes different classes of threads and threads on a lower class will only run if no thread in a higher class wants to run as well. There is no back-off strategy for this, so if you permanently have high class threads available, low class threads will never get any CPU time and thus never any chance to perform any work.

The problem appears as follow: Your code obtains a spinlock in a low prio class thread and while it is in the middle of that lock, the time quantum has exceeded and the thread stops running. The only way how this spinlock can be released again is if that low prio class thread gets CPU time again but this is not guaranteed to happen. You may have a couple of high prio class threads that constantly want to run and the task scheduler will always prioritize those. One of them may run across the spinlock and try to obtain it, which isn't possible of course, and the system will make it yield. The problem is: A thread that yielded is immediately available for running again! Having a higher prio than the thread holding the lock, the thread holding the lock has no chance to get CPU runtime. Either some other thread will get runtime or the thread that just yielded.

Why does this problem not occur with mutexes? When the high prio thread cannot obtain the mutex, it won't yield, it may spin a bit but will eventually be sent to sleep. A sleeping thread is not available for running until it is woken up by an event, e.g. an event like the mutex being unlocked it has been waiting for. Apple is aware of that problem and has deprecated OSSpinLock as a result. The new lock is called os_unfair_lock. This lock avoids the situation mentioned above as it is aware of the different thread priority classes. If you are sure that using spinlocks is a good idea in your iOS project, use that one. Stay away from OSSpinLock! And under no circumstances implement your own spinlocks in iOS! If in doubt, use a mutex. macOS is not affected by this issue as it has a different thread scheduler that won't allow any thread (even low prio threads) to "run dry" on CPU time, still the same situation can arise there and will then lead to very poor performance, thus OSSpinLock is deprecated on macOS as well.

Why Disable One Local Interrupt or Preemption Can Cause The Whole System with 4 Cpus Unresponsive