Is There a Point to Trapping "Segfault"

Is there a point to trapping segfault?

You can't really hope to recover from a segfault. You can detect that it happened, and dump out relevant application-specific state if possible, but you can't continue the process. This is because (amongst others)

  • The thread which failed cannot be continued, so your only options are longjmp or terminating the thread. Neither is safe in most cases.
  • Either way, you may leave a mutex / lock in a locked state which causes other threads to wait forever
  • Even if that doesn't happen, you may leak resources
  • Even if you don't do either of those things, the thread which segfaulted may have left the internal state of the application inconsistent when it failed. An inconsistent internal state could cause data errors or further bad behaviour subsequently which causes more problems than simply quitting

So in general, there is no point in trapping it and doing anything EXCEPT terminating the process in a fairly abrupt fashion. There's no point in attempting to write (important) data back to disc, or continue to do other useful work. There is some point in dumping out state to logs- which many applications do - and then quitting.

A possibly useful thing to do might be to exec() your own process, or have a watchdog process which restarts it in the case of a crash. (NB: exec does not always have well defined behaviour if your process has >1 thread)

Is it possible to trap a segmentation fault?

I had a similar problem, rendering cad files via pythonocc.
Sometimes when opening a file the script just segfaulted. Really annoying. You had to remove the file manually and restart the batch.

So basically the idea is to start an extra process for the task and check it's exitcode:

import multiprocessing as mp


def do_stuff_that_segfaults(param):
call_shitty_library(param)

def main():
p = mp.Process(target=do_stuff_that_segfaults, args=param)
p.start()
p.join()
if p.exitcode == -11: # Segmentation fault
do_stuff_in_case_of_segfault()

I've also tried other suggestions, like the Segmentation Fault Catch you linked to but to no avail.
I really would have liked to use mp.pool() to use all cores, but you don't get the exit status from mp.pool().

So far the code runs well and I moved the files resulting in a segfault into another folder via do_stuff_in_case_of_segfault() without getting my main script killed.

Is there any way to guarantee a segfault?


  1. Are ALL segfaults undefined behavior?

This question is trickier than it might seem, because "undefined behavior" is a description of either a C source program, or the result of running a C program in the "abstract machine" that describes behavior of C programs in general; but "segmentation fault" is a possible behavior of a particular operating system, often with help from particular CPU features.

The C Standard doesn't say anything at all about segmentation faults. The one nearly relevant thing it does say is that if a program execution does not have undefined behavior, then a real implementation's execution of the program will have the same observable behavior as the abstract machine's execution. And "observable behavior" is defined to include just accesses to volatile objects, data written into files, and input and output of interactive devices.

If we can assume that a "segmentation fault" always prevents further actions by a program, then any segmentation fault without the presence of undefined behavior could only happen after all of the observable behavior has completed as expected. (But note that valid optimizations can sometimes cause things to happen in a different order from the obvious one.)

So a situation where a program causes a segmentation fault (for the OS) although there is no undefined behavior (according to the C Standard) doesn't make much sense for a real compiler and OS, but we can't rule it out completely.

But also, all that is assuming perfect computers. If RAM is bad, an intended address value might end up changed. There are even very infrequent but measurable events where cosmic rays can change a bit within otherwise good RAM. Soft errors like those could cause a segmentation fault (on a system where "segmentation fault" is a thing), for practically any perfectly written C program, with no undefined behavior possible on any implementation or input.


  1. If no, is there any way to ensure a segfault?

That depends on the context, and what you mean by "ensure".

Can you write a C program that will always cause a segfault? No, because some computers might not even have such a concept.

Can you write a C program that always causes a segfault if it is possible on a computer? No, because some compilers might do things to avoid the actual problem in some cases. And since the program's behavior is undefined, not causing a segfault is just as valid a result as causing a segfault. In particular, one real obstacle you might run into, doing even simple things like deliberately dereferencing a null pointer value, is that compiler optimizations sometimes assume that the inputs and logic will always turn out so that undefined behavior will not happen, since it's okay to not do what the program says for inputs that do lead to undefined behavior.

Knowing details about how one specific OS, and possibly the CPU, handle memory and sometimes generate segmentation faults, can you write assembly instructions that will always cause a segfault? Certainly, if the segfault handling is of any value at all. Can you write a C program that will trigger a segfault in roughly the same manner? Most probably.

After Segfault: Is there a way, to check if pointer is still valid?

Of course if the stack or other memory that you rely upon has been corrupted then there could be problems, but that is true for any code.

Assuming that that there is no problem with the stack or other memory that you rely upon, and assuming that you do not call any functions like malloc() that are not async-signal safe, and assuming that you do not attempt to return from your signal handler, then there should be no problem reading or writing your buffer from within your signal handler.

If you are trying to test whether a particular address is valid, you could use a system call such as mincore() and check for an error result.

Catching segfaults in C

You have to define a signal handler. This is done on Unix systems using the function sigaction. I've done this with the same code on Fedora 64- and 32-bit, and on Sun Solaris.

Why is a segmentation fault not recoverable?


When exactly does segmentation fault happen (=when is SIGSEGV sent)?

When you attempt to access memory you don’t have access to, such as accessing an array out of bounds or dereferencing an invalid pointer. The signal SIGSEGV is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".

Why is the process in undefined behavior state after that point?

Because one or several of the variables in the program didn’t behave as expected. Let’s say you have some array that is supposed to store a number of values, but you didn’t allocate enough room for all them. So only those you allocated room for get written correctly, and the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.

Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such segmentation faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.

If we look at the behavior of "bare metal" microcontroller systems without any OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example, overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.

Why is it not recoverable?

Because the OS doesn’t know what your program is supposed to be doing.

Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren’t allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.

Why does this solution avoid that unrecoverable state? Does it even?

That solution is just ignoring the error and keeps on going. It doesn’t fix the problem that caused it. It’s a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.

We have to realize that a segmentation fault is a symptom of a bug, not the cause.

Segmentation fault handling

The default action for things like SIGSEGV is to terminate your process but as you've installed a handler for it, it'll call your handler overriding the default behavior. But the problem is segfaulting instruction may be retried after your handler finishes and if you haven't taken measures to fix the first seg fault, the retried instruction will again fault and it goes on and on.

So first spot the instruction that resulted in SIGSEGV and try to fix it (you can call something like backtrace() in the handler and see for yourself what went wrong)

Also, the POSIX standard says that,

The behavior of a process is undefined after it returns normally from
a signal-catching function for a [XSI] SIGBUS, SIGFPE, SIGILL, or
SIGSEGV signal that was not generated by kill(), [RTS] sigqueue(),
or raise().

So, the ideal thing to do is to fix your segfault in the first place. Handler for segfault is not meant to bypass the underlying error condition

So the best suggestion would be- Don't catch the SIGSEGV. Let it dump core. Analyze the core. Fix the invalid memory reference and there you go!

Why are segfaults called faults (and not aborts) if they are not recoverable?

At a CPU level, modern OSes don't use x86 segment limits for memory protection. (And in fact they couldn't even if they wanted to in long mode (x86-64); segment base is fixed at 0 and limit at -1).

OSes use virtual memory page tables, so the real CPU exception on an out-of-bounds memory access is a page fault.

x86 manuals call this a #PF(fault-code) exception, e.g. see the list of exceptions add can raise. Fun fact: the x86 exception for access outside of a segment limit is #GP(0).

It's up to the OS's page-fault handler to decide how to handle it. Many #PF exceptions happen as part of normal operation:

  • copy-on-write mapping got written: copy the page and mark it writeable in the page table, then return to user-space to retry the instruction that faulted. (This is a type of "soft" aka "minor" page fault.)
  • other soft page fault, e.g. the kernel was lazy and didn't actually have the page table updated to reflect the mappings the process made. (e.g. mmap(2) without MAP_POPULATE).
  • hard page fault: find some physical memory and read the file from disk (a file mapping or from swap file/partition for anonymous pages).

After sorting out any of the above, update the page table that the CPU reads on its own, and invalidate that TLB entry if necessary. (e.g. valid but read-only changed to valid + read-write).

Only if the kernel finds that the process really doesn't logically have anything mapped to that address (or that it's a write to a read-only mapping) will the kernel deliver a SIGSEGV to the process. This is purely a software thing, after sorting out the cause of the hardware exception.


The English text for SIGSEGV (from strerror(3)) is "Segmentation Fault" on all Unix/Linux systems, so that's what's printed (by the shell) when a child process dies from that signal.

This term is well understood, so even though it mostly only exists for historical reasons and hardware doesn't use segmentation.

Note that you also get a SIGSEGV for stuff like trying to execute privileged instructions in user-space (like wbinvd or wrmsr (write model-specific register)). At a CPU level, the x86 exception is #GP(0) for privileged instructions when you're not in ring 0 (kernel mode).

Also for misaligned SSE instructions (like movaps), although some Unixes on other platforms send SIGBUS for misaligned accesses faults (e.g. Solaris on SPARC).



Why do we call it a segmentation fault and not a segmentation abort then?

It is recoverable. It doesn't crash the whole machine / kernel, it just means that user-space process tried to do something that the kernel doesn't allow.

Even for that process that segfaulted it can be recoverable. This is why it's a catchable signal, unlike SIGKILL. Usually you can't just resume execution, but you can usefully record where the fault was (e.g. print a precise exception error message and even a stack backtrace).

The signal handler for SIGSEGV could longjmp or whatever. Or if the SIGSEGV was expected, then modify the code or the pointer used for the load, before returning from the signal handler. (e.g. for a Meltdown exploit, although there are much more efficient techniques that do the chained loads in the shadow of a mispredict or something else that suppresses the exception, instead of actually letting the CPU raise an exception and catching the SIGSEGV the kernel delivers)

Most programming languages (other than assembly) aren't low-level enough to give well defined behaviour when optimizing around an access that might segfault in a way that would let you write a handler that recovers. This is why usually you don't do anything more than print an error message (and maybe a stack backtrace) in a SIGSEGV handler if you install one at all.


Some JIT compilers for sandboxed languages (like Javascript) use hardware memory access checks to eliminate NULL pointer checks. In the normal case there's no fault, so it doesn't matter how slow the faulting case is.

A Java JVM can turn a SIGSEGV received by a thread of the JVM into a NullPointerException for the Java code it's running, without any problems for the JVM.

  • Effective Null Pointer Check Elimination Utilizing Hardware Trap a research paper on this for Java, from three IBM scientists.

  • SableVM: 6.2.4 Hardware Support on Various Architectures about NULL pointer checks

A further trick is to put the end of an array at the end of a page (followed by a large-enough unmapped region), so bounds-checking on every access is done for free by the hardware. If you can statically prove the index is always positive, and that it can't be larger than 32 bit, you're all set.

  • Implicit Java Array Bounds Checking on 64-bit
    Architectures. They talk about what to do when array size isn't a multiple of the page size, and other caveats.


Trap vs. abort

I don't think there's standard terminology to make that distinction. It depends what kind of recovery you're talking about. Obviously the OS can keep running after anything user-space can make the hardware do, otherwise unprivileged user-space could crash the machine.

Related: On
When an interrupt occurs, what happens to instructions in the pipeline?, Andy Glew (CPU architect who worked on Intel's P6 microarchitecture) says "trap" is basically any interrupt that's caused by the code that's running (rather than an external signal), and happens synchronously. (e.g. when a faulting instruction reaches the retirement stage of the pipeline without an earlier branch-mispredict or other exception being detected first).

"Abort" isn't standard CPU-architecture terminology. Like I said, you want the OS to be able to continue no matter what, and only hardware failure or kernel bugs normally prevent that.

AFAIK, "abort" is not very standard operating-systems terminology either. Unix has signals, and some of them are uncatchable (like SIGKILL and SIGSTOP), but most can be caught.

SIGABRT can be caught by a signal handler. The process exits if the handler returns, so if you don't want that you can longjmp out of it. But AFAIK no error condition raises SIGABRT; it's only sent manually by software, e.g. by calling the abort() library function. (It often results in a stack backtrace.)



x86 exception terminology

If you look at x86 manuals or this exception table on the osdev wiki, there are specific meanings in this context (thanks to @MargaretBloom for the descriptions):

  • trap: raised after an instruction successfully completed, the return address points after the trapping inst. #DB debug and #OF overflow ( into) exceptions are traps. (Some sources of #DB are faults instead) . But int 0x80 or other software interrupt instructions are also traps, as is syscall (but it puts the return address in rcx instead of pushing it; syscall is not an exception, and thus not really a trap in this sense)

  • fault: raised after an attempted execution is made and then rolled back; the return address points to the faulting instruction. (Most exception types are faults)

  • abort is when the return address points to an unrelated location (i.e. for #DF double-fault and #MC machine-check). Triple fault can't be handled; it's what happens when the CPU hits an exception trying to run the double-fault handler, and really does stop the whole CPU.

Note that even Intel CPU architects like Andy Glew sometimes use the term "trap" more generally, I think meaning any synchronous exception, when using discussion computer-architecture theory. Don't expect people to stick to the above terminology unless you're actually talking about handling specific exceptions on x86. Although it is useful and sensible terminology, and you could use it in other contexts. But if you want to make the distinction, you should clarify what you mean by each term so everyone's on the same page.

Dangers of stack overflow and segmentation fault in C++


Accordingly, whenever the stack storage is used it needs to be
dealocated manually, but if the heap is used, then the dealocation is
automaticcally done.

When you use stack - local variables in the function - they are deallocated automatically when the function ends (returns).

When you allocate from the heap, the memory allocated remains "in use" until it is freed. If you don't do that, your program, if it runes for long enough and keep allocating "stuff", will use all memory available to it, and eventually fail.

Note that "stackfault" is almost impossible to recover from in an application, because the stack is no longer usable when it's full, and most operations to "recover from error" will involve using SOME stack memory. The processor typically has a special trap to recover from stack fault, but that lands insise the operating system, and if the OS determines the application has run out of stack, it often shows no mercy at all - it just "kills" the application immediately.

1.- Let'suposse that I run a program with a recursion solution by using an infinite iteration of functions. Theoretically the program
crashes (stack overflow), but does it cause some trouble to the
computer itself? (To the RAM maybe or to the SO).

No, the computer itself is not harmed by this in and of itself. There may of course be data-loss if your program wasn't saving something that the user was working on.

Unless the hardware is very badly designed, it's very hard to write code that causes any harm to the computer, beyond loss of stored data (of course, if you write a program that fills the entire hard disk from the first to the last sector, your data will be overwritten with whatever your program fills the disk with - which may well cause the machine to not boot again until you have re-installed an operating system on the disk). But RAM and processors don't get damaged by bad coding (fortunately, as most programmers make mistakes now and again).

2.- What happens if I forget to dealocate memory on the heap. I mean, does it just cause trouble to the program or it is permanent to the
computer in general. I mean it might be that such memory could not be
used never again or something.

Once the program finishes (and most programs that use "too much memory" does terminate in some way or another, at some point).

Of course, how well the operating system and other applications handle "there is no memory at all available" varies a little bit. The operating system in itself is generally OK with it, but some drivers that are badly written may well crash, and thus cause your system to reboot if you are unlucky. Applications are more prone to crashing due to there not being enough memory, because allocations end up with NULL (zero) as the "returned address" when there is no memory available. Using address zero in a modern operating system will almost always lead to a "Segmentation fault" or similar problem (see below for more on that).

But these are extreme cases, most systems are set up such that one application gobbling all available memory will in itself fail before the rest of the system is impacted - not always, and it's certainly not guaranteed that the application "causing" the problem is the first one to be killed if the OS kills applications simply because they "eat a lot of memory". Linux does have a "Out of memory killer", which is a pretty drastic method to ensure the system can continue to work [by some definition of "work"].

3.- What are the problems of getting a segmentation fault (the heap).

Segmentation faults don't directly have anything to do with the heap. The term segmentation fault comes from older operating systems (Unix-style) that used "segments" of memory for different usages, and "Segmentation fault" was when the program went outside it's allocated segment. In modern systems, the memory is split into "pages" - typically 4KB each, but some processors have larger pages, and many modern processors support "large pages" of, for examble, 2MB or 1GB, which is used for large chunks of memory.

Now, if you use an address that points to a page that isn't there (or isn't "yours"), you get a segmentation fault. This, typically will end the application then and there. You can "trap" segmentation fault, but in all operating systems I'm aware of, it's not valid to try to continue from this "trap" - but you could for example store away some files to explain what happened and help troubleshoot the problem later, etc.

How to catch segmentation fault in Linux?

On Linux we can have these as exceptions, too.

Normally, when your program performs a segmentation fault, it is sent a SIGSEGV signal. You can set up your own handler for this signal and mitigate the consequences. Of course you should really be sure that you can recover from the situation. In your case, I think, you should debug your code instead.

Back to the topic. I recently encountered a library (short manual) that transforms such signals to exceptions, so you can write code like this:

try
{
*(int*) 0 = 0;
}
catch (std::exception& e)
{
std::cerr << "Exception caught : " << e.what() << std::endl;
}

Didn't check it, though. Works on my x86-64 Gentoo box. It has a platform-specific backend (borrowed from gcc's java implementation), so it can work on many platforms. It just supports x86 and x86-64 out of the box, but you can get backends from libjava, which resides in gcc sources.



Related Topics



Leave a reply



Submit