How to Handle Sigsegv Signal in Userspace Using Rust

Why can't capture SIGSEGV using signalfd?

See this and that answers for detailed explanations. Read carefully signal(7) and signal-safety(7). Remember also that the virtual address space of your process is common to, and shared between, all the threads of that process. See also proc(5) (and use pmap(1)) and try reading /proc/self/maps from inside your process to understand its actual virtual address space.

Grossly speaking, if you handle (an asynchronous) SIGSEGV (produced by the kernel after some exception fault) with signalfd(2), it is looking like you installed a "kernel" signal handler which magically "write"-s some bytes on some file descriptor (you almost could mimick signalfd by installing a signal handler writing on some pipe; but signalfd guarantees some "atomicity" that you won't have otherwise).

When you are back from that handling, the machine is in the same condition, so the SIGSEGV happens again.

If you want to handle SIGSEGV you need to use sigaction(2) or the obsolete signal(2) to install a handling routine (so you can't use signalfd for SIGSEGV), and then you should either

(more or less portably) avoid returning from your signal handler (e.g. by calling siglongjmp(3) from your signal handler installed with sigaction(2))
non-portably (in a processor and operating system specific way) change the machine context (given by the third argument (a pointer to some processor specific ucontext_t) to your handler installed by sigaction with SA_SIGINFO), e.g. by changing some registers, or change the address space (e.g. by calling mmap(2) from inside the handler).

The insight is that a SIGSEGV handler is entered with the program counter set to the faulting machine instruction. When you return from a SIGSEGV handler, the registers are in the state given to it (the pointer ucontext_t as the third argument of your sa_sigaction function passed to sigaction). If you don't change that state, the same machine instruction is re-executed, and since you didn't change anything the same fault happens and the same SIGSEGV signal is sent again by the kernel.

BTW, a nice example of a software handling cleverly and non-portably the SIGSEGV is the Ravenbrook MPS garbage collection library. Their write barrier (in GC parlance) is implemented by handling SIGSEGV. This is very clever (and non portable) code.

NB: in practice, if you just want to display backtrace information, you could do it from a SIGSEGV handler (e.g. by using GCC libbacktrace or backtrace(3) then _exit(2)-ing instead of returning from your SIGSEGV signal handler); it is not perfect and won't always work -e.g. if you corrupted the memory heap- because you will call non async-signal-safe functions, but in practice works well enough. Recent GCC is doing that (inside the compiler e.g. cc1plus and its plugins), and it helps a lot.

Why is a segmentation fault not recoverable?

When exactly does segmentation fault happen (=when is SIGSEGV sent)?

When you attempt to access memory you don’t have access to, such as accessing an array out of bounds or dereferencing an invalid pointer. The signal SIGSEGV is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".

Why is the process in undefined behavior state after that point?

Because one or several of the variables in the program didn’t behave as expected. Let’s say you have some array that is supposed to store a number of values, but you didn’t allocate enough room for all them. So only those you allocated room for get written correctly, and the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.

Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such segmentation faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.

If we look at the behavior of "bare metal" microcontroller systems without any OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example, overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.

Why is it not recoverable?

Because the OS doesn’t know what your program is supposed to be doing.

Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren’t allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.

Why does this solution avoid that unrecoverable state? Does it even?

That solution is just ignoring the error and keeps on going. It doesn’t fix the problem that caused it. It’s a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.

We have to realize that a segmentation fault is a symptom of a bug, not the cause.

What is a bus error? Is it different from a segmentation fault?

Bus errors are rare nowadays on x86 and occur when your processor cannot even attempt the memory access requested, typically:

using a processor instruction with an address that does not satisfy its alignment requirements.

Segmentation faults occur when accessing memory which does not belong to your process. They are very common and are typically the result of:

using a pointer to something that was deallocated.
using an uninitialized hence bogus pointer.
using a null pointer.
overflowing a buffer.

PS: To be more precise, it is not manipulating the pointer itself that will cause issues. It's accessing the memory it points to (dereferencing).

How to Handle Sigsegv Signal in Userspace Using Rust

Why can't capture SIGSEGV using signalfd?

Why is a segmentation fault not recoverable?

What is a bus error? Is it different from a segmentation fault?

Related Topics

Leave a reply