Why can't capture SIGSEGV using signalfd?
See this and that answers for detailed explanations. Read carefully signal(7) and signal-safety(7). Remember also that the virtual address space of your process is common to, and shared between, all the threads of that process. See also proc(5) (and use pmap(1)) and try reading /proc/self/maps
from inside your process to understand its actual virtual address space.
Grossly speaking, if you handle (an asynchronous) SIGSEGV
(produced by the kernel after some exception fault) with signalfd(2), it is looking like you installed a "kernel" signal handler which magically "write"-s some bytes on some file descriptor (you almost could mimick signalfd
by installing a signal handler writing on some pipe; but signalfd
guarantees some "atomicity" that you won't have otherwise).
When you are back from that handling, the machine is in the same condition, so the SIGSEGV happens again.
If you want to handle SIGSEGV
you need to use sigaction(2) or the obsolete signal(2)
to install a handling routine (so you can't use signalfd
for SIGSEGV), and then you should either
- (more or less portably) avoid returning from your signal handler (e.g. by calling siglongjmp(3) from your signal handler installed with sigaction(2))
- non-portably (in a processor and operating system specific way) change the machine context (given by the third argument (a pointer to some processor specific
ucontext_t
) to your handler installed bysigaction
withSA_SIGINFO
), e.g. by changing some registers, or change the address space (e.g. by calling mmap(2) from inside the handler).
The insight is that a SIGSEGV handler is entered with the program counter set to the faulting machine instruction. When you return from a SIGSEGV handler, the registers are in the state given to it (the pointer ucontext_t
as the third argument of your sa_sigaction
function passed to sigaction
). If you don't change that state, the same machine instruction is re-executed, and since you didn't change anything the same fault happens and the same SIGSEGV signal is sent again by the kernel.
BTW, a nice example of a software handling cleverly and non-portably the SIGSEGV is the Ravenbrook MPS garbage collection library. Their write barrier (in GC parlance) is implemented by handling SIGSEGV. This is very clever (and non portable) code.
NB: in practice, if you just want to display backtrace information, you could do it from a SIGSEGV
handler (e.g. by using GCC libbacktrace or backtrace(3) then _exit(2)-ing instead of returning from your SIGSEGV
signal handler); it is not perfect and won't always work -e.g. if you corrupted the memory heap- because you will call non async-signal-safe functions, but in practice works well enough. Recent GCC is doing that (inside the compiler e.g. cc1plus
and its plugins), and it helps a lot.
Why is a segmentation fault not recoverable?
When exactly does segmentation fault happen (=when is SIGSEGV sent)?
When you attempt to access memory you don’t have access to, such as accessing an array out of bounds or dereferencing an invalid pointer. The signal SIGSEGV
is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".
Why is the process in undefined behavior state after that point?
Because one or several of the variables in the program didn’t behave as expected. Let’s say you have some array that is supposed to store a number of values, but you didn’t allocate enough room for all them. So only those you allocated room for get written correctly, and the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.
Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such segmentation faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.
If we look at the behavior of "bare metal" microcontroller systems without any OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example, overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.
Why is it not recoverable?
Because the OS doesn’t know what your program is supposed to be doing.
Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren’t allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.
Why does this solution avoid that unrecoverable state? Does it even?
That solution is just ignoring the error and keeps on going. It doesn’t fix the problem that caused it. It’s a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.
We have to realize that a segmentation fault is a symptom of a bug, not the cause.
What is a bus error? Is it different from a segmentation fault?
Bus errors are rare nowadays on x86 and occur when your processor cannot even attempt the memory access requested, typically:
- using a processor instruction with an address that does not satisfy its alignment requirements.
Segmentation faults occur when accessing memory which does not belong to your process. They are very common and are typically the result of:
- using a pointer to something that was deallocated.
- using an uninitialized hence bogus pointer.
- using a null pointer.
- overflowing a buffer.
PS: To be more precise, it is not manipulating the pointer itself that will cause issues. It's accessing the memory it points to (dereferencing).
Related Topics
Installing PHPsh on Linux, Python Error
Difference Between "Cpu/Mem-Loads/Pp" and "Cpu/Mem-Loads/"
Cython Standalone Executable on Ubuntu
How to Detect Whether Tomcat and Ant Are Installed on Linux Machine
Replace Key:Value from One File in Another File in Shellscript
How to Generate Files in a Docker Container for Having The Same Owner as The Host's User
Using Find But Only in Subdirectories Matching Certain Pattern
PHPmyadmin, Neginx Error.Log - Check Group Www-Data Has Read Access and Open_Basedir
How Make /Var/Www Contents Editable by Ide
Bash Script Counting Instances of Itself Wrongly
Function Return Values Within Bash If Statements
When Compiling Programs to Run Inside a Vm, What Should March and Mtune Be Set To
Tensorflow Recommended System Specifications
How to Use Multiple Lower Layers in Overlayfs
How to Run My Own Script at Every Bootup
Libcurl with Libssh2 - One or More Libs Available at Link-Time Are Not Available Run-Time
How to Automatically Start an Application That Needs X in Linux