Best Practices for Recovering from a Segmentation Fault

Best practices for recovering from a segmentation fault

The best practice is to fix the original issue causing the core dump, recompile and then relaunch the application.

To catch these errors before deploying in the wild, do plenty of peer review and write lots of tests.

Why is a segmentation fault not recoverable?

When exactly does segmentation fault happen (=when is SIGSEGV sent)?

When you attempt to access memory you don’t have access to, such as accessing an array out of bounds or dereferencing an invalid pointer. The signal SIGSEGV is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".

Why is the process in undefined behavior state after that point?

Because one or several of the variables in the program didn’t behave as expected. Let’s say you have some array that is supposed to store a number of values, but you didn’t allocate enough room for all them. So only those you allocated room for get written correctly, and the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.

Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such segmentation faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.

If we look at the behavior of "bare metal" microcontroller systems without any OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example, overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.

Why is it not recoverable?

Because the OS doesn’t know what your program is supposed to be doing.

Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren’t allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.

Why does this solution avoid that unrecoverable state? Does it even?

That solution is just ignoring the error and keeps on going. It doesn’t fix the problem that caused it. It’s a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.

We have to realize that a segmentation fault is a symptom of a bug, not the cause.

How to recover from Segmentation fault in Python?

Call the third module in another process, so it doesn't crash the main one when segfault occurs.

Recover from segfault in Python

I had some unreliable C extensions throw segfaults every once in a while and, since there was no way I was going to be able to fix that, what I did was create a decorator that would run the wrapped function in a separate process. That way you can stop segfaults from killing the main process.

Something like this:
https://gist.github.com/joezuntz/e7e7764e5b591ed519cfd488e20311f1

Mine was a bit simpler, and it did the job for me. Additionally it lets you choose a timeout and a default return value in case there was a problem:

#! /usr/bin/env python3

# std imports
import multiprocessing as mp

def parametrized(dec):
    """This decorator can be used to create other decorators that accept arguments"""

    def layer(*args, **kwargs):
        def repl(f):
            return dec(f, *args, **kwargs)

        return repl

    return layer

@parametrized
def sigsev_guard(fcn, default_value=None, timeout=None):
    """Used as a decorator with arguments.
    The decorated function will be called with its input arguments in another process.

    If the execution lasts longer than *timeout* seconds, it will be considered failed.

    If the execution fails, *default_value* will be returned.
    """

    def _fcn_wrapper(*args, **kwargs):
        q = mp.Queue()
        p = mp.Process(target=lambda q: q.put(fcn(*args, **kwargs)), args=(q,))
        p.start()
        p.join(timeout=timeout)
        exit_code = p.exitcode

        if exit_code == 0:
            return q.get()

        logging.warning('Process did not exit correctly. Exit code: {}'.format(exit_code))
        return default_value

    return _fcn_wrapper

So you would use it like:


@sigsev_guard(default_value=-1, timeout=60)
def your_risky_function(a,b,c,d):
    ...

Coming back to life after Segmentation Violation

#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <signal.h>
#include <stdlib.h>
#include <ucontext.h>

void safe_func(void)
{
    puts("Safe now ?");
    exit(0); //can't return to main, it's where the segfault occured.
}

void
handler (int cause, siginfo_t * info, void *uap)
{
  //For test. Never ever call stdio functions in a signal handler otherwise*/
  printf ("SIGSEGV raised at address %p\n", info->si_addr);
  ucontext_t *context = uap;
  /*On my particular system, compiled with gcc -O2, the offending instruction
  generated for "*f = 16;" is 6 bytes. Lets try to set the instruction
  pointer to the next instruction (general register 14 is EIP, on linux x86) */
  context->uc_mcontext.gregs[14] += 6; 
  //alternativly, try to jump to a "safe place"
  //context->uc_mcontext.gregs[14] = (unsigned int)safe_func;
}

int
main (int argc, char *argv[])
{
  struct sigaction sa;
  sa.sa_sigaction = handler;
  int *f = NULL;
  sigemptyset (&sa.sa_mask);
  sa.sa_flags = SA_SIGINFO;
  if (sigaction (SIGSEGV, &sa, 0)) {
      perror ("sigaction");
      exit(1);
  }
  //cause a segfault
  *f = 16; 
  puts("Still Alive");
  return 0;
}

$ ./a.out
SIGSEGV raised at address (nil)
Still Alive

I would beat someone with a bat if I saw something like this in production code though, it's an ugly, for-fun hack. You'll have no idea if the segfault have corrupted some of your data, you'll have no sane way of recovering and know that everything is Ok now, there's no portable way of doing this. The only mildly sane thing you could do is try to log an error (use write() directly, not any of the stdio functions - they're not signal safe) and perhaps restart the program. For those cases you're much better off writing a superwisor process that monitors a child process exit, logs it and starts a new child process.

Segmentation fault in CS50 (2020) recovery program

when compiling, always enable the warnings, then fix those warnings.

gcc -ggdb3 -Wall -Wextra -Wconversion -pedantic -std=gnu11 -c "untitled1.c" -o "untitled1.o"

results in several warnings like:

untitled1.c:46:91: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses]

if(buffer[0] == 0xff & buffer[1] == 0xd8 & buffer[2] == 0xff & (buffer[3] & 0xf0) == 0xe0)

Note: a single & is a bit wise AND. You really want a logical AND && for all but the last one in this statement

regarding;

FILE *forensic_image = fopen(argv[1],"r");

Always check (!=NULL) the returned value to assure the operation was successful. If not successful (==NULL) then call

perror( "fopen failed" );

to output to stderr both your error message and the text reason the system thinks the error occurred.

regarding:

while(!feof(forensic_image))

please read: why while( !feof() is always wrong

regarding:

FILE *forensic_image = fopen(argv[1],"r");

This is already done in the prior code block. There is absolutely no reason to do this again AND it will create problems in the code. Suggest: replacing:

if(fopen(argv[1],"r") == NULL)      
{         
    printf("This image cannot be opened for reading\n");
    return 1;     
}

with:

if( (forensic_image = fopen(argv[1],"r") ) == NULL)      
{         
    perror( "fopen for input file failed" );         
    exit( EXIT_FAILURE );     
}

regarding:

BYTE *buffer = malloc( 512 * sizeof(BYTE) );

and later:

free( buffer );

This is a waste of code and resources. The project only needs one such instance. Suggest:

#define RECORD_LEN 512

and

unsigned char buffer[ RECORD_LEN ];

regarding;

fread(buffer, sizeof(BYTE), 512, forensic_image);

The function: fread() returns a size_t. You should be assigning the returned value to a size_t variable and checking that value to assure the operation was successful. Infact, that statement should be in the while() condition

regarding;

sprintf(filename, "%03i.jpg", JPEG_num);

This results in undefined behavior and can result in a seg fault event because the pointer filename is initialized to NULL. Suggest:

char filename[20];

to avoid that problem

regarding:

else    // If not first JPEG             
{                 
    fclose(jpeg0);

if your (for instance) working with the 3rd file, then jpeg0 is already closed, resulting in a run time error. Suggest removing the statement:

FILE *jpeg0;

and always using jpegn

regarding;

else    // If already found JPEG         
{             
    fwrite(buffer, sizeof(BYTE), 512, jpegn);         
}

on the first output file, jpegn is not set, so this results in a crash. Again, ONLY use jpegn for all output file operations.

regarding:

fwrite(buffer, sizeof(BYTE), 512, jpegn);

this returns the number of (second parameter) amounts actually written, so this should be:

if( fwrite(buffer, sizeof(BYTE), 512, jpegn) != 512 ) { // handle error }

the posted code contains some 'magic' numbers, like 512. 'magic' numbers are numbers with no basis. 'magic' numbers make the code much more difficult to understand, debug, etc. Suggest using an enum statement or #define statement to give those 'magic' numbers meaningful names, then use those meaningful names throughout the code.

Fixing Segmentation faults in C++

Compile your application with -g, then you'll have debug symbols in the binary file.
Use gdb to open the gdb console.
Use file and pass it your application's binary file in the console.
Use run and pass in any arguments your application needs to start.
Do something to cause a Segmentation Fault.
Type bt in the gdb console to get a stack trace of the Segmentation Fault.

debugging a program with segmentation fault

You should use realloc() to grow the dynamic arrays when you need to put a new element in them. In your code you're doing just one allocation at the start.

For instance, write this at the start (after you scanf() to get total_number_of_shelves and total_number_of_queries):

total_number_of_books = malloc(total_number_of_shelves * sizeof(int));
total_number_of_pages = malloc(total_number_of_shelves * sizeof(int*));

// TODO: you should initialize every element of the first array to `0`,
//   and every element of the second array to `NULL`

And when you need to put a new element inside, you use realloc():

total_number_of_books[x] += 1;

total_number_of_pages[x] = realloc(total_number_of_pages[x], total_number_of_books[x] * sizeof(int));

...

Best Practices for Recovering from a Segmentation Fault