Best practices for recovering from a segmentation fault
The best practice is to fix the original issue causing the core dump, recompile and then relaunch the application.
To catch these errors before deploying in the wild, do plenty of peer review and write lots of tests.
Why is a segmentation fault not recoverable?
When exactly does segmentation fault happen (=when is SIGSEGV sent)?
When you attempt to access memory you don’t have access to, such as accessing an array out of bounds or dereferencing an invalid pointer. The signal SIGSEGV
is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".
Why is the process in undefined behavior state after that point?
Because one or several of the variables in the program didn’t behave as expected. Let’s say you have some array that is supposed to store a number of values, but you didn’t allocate enough room for all them. So only those you allocated room for get written correctly, and the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.
Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such segmentation faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.
If we look at the behavior of "bare metal" microcontroller systems without any OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example, overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.
Why is it not recoverable?
Because the OS doesn’t know what your program is supposed to be doing.
Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren’t allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.
Why does this solution avoid that unrecoverable state? Does it even?
That solution is just ignoring the error and keeps on going. It doesn’t fix the problem that caused it. It’s a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.
We have to realize that a segmentation fault is a symptom of a bug, not the cause.
How to recover from Segmentation fault in Python?
Call the third module in another process, so it doesn't crash the main one when segfault occurs.
Recover from segfault in Python
I had some unreliable C extensions throw segfaults every once in a while and, since there was no way I was going to be able to fix that, what I did was create a decorator that would run the wrapped function in a separate process. That way you can stop segfaults from killing the main process.
Something like this:
https://gist.github.com/joezuntz/e7e7764e5b591ed519cfd488e20311f1
Mine was a bit simpler, and it did the job for me. Additionally it lets you choose a timeout and a default return value in case there was a problem:
#! /usr/bin/env python3
# std imports
import multiprocessing as mp
def parametrized(dec):
"""This decorator can be used to create other decorators that accept arguments"""
def layer(*args, **kwargs):
def repl(f):
return dec(f, *args, **kwargs)
return repl
return layer
@parametrized
def sigsev_guard(fcn, default_value=None, timeout=None):
"""Used as a decorator with arguments.
The decorated function will be called with its input arguments in another process.
If the execution lasts longer than *timeout* seconds, it will be considered failed.
If the execution fails, *default_value* will be returned.
"""
def _fcn_wrapper(*args, **kwargs):
q = mp.Queue()
p = mp.Process(target=lambda q: q.put(fcn(*args, **kwargs)), args=(q,))
p.start()
p.join(timeout=timeout)
exit_code = p.exitcode
if exit_code == 0:
return q.get()
logging.warning('Process did not exit correctly. Exit code: {}'.format(exit_code))
return default_value
return _fcn_wrapper
So you would use it like:
@sigsev_guard(default_value=-1, timeout=60)
def your_risky_function(a,b,c,d):
...
Coming back to life after Segmentation Violation
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <signal.h>
#include <stdlib.h>
#include <ucontext.h>
void safe_func(void)
{
puts("Safe now ?");
exit(0); //can't return to main, it's where the segfault occured.
}
void
handler (int cause, siginfo_t * info, void *uap)
{
//For test. Never ever call stdio functions in a signal handler otherwise*/
printf ("SIGSEGV raised at address %p\n", info->si_addr);
ucontext_t *context = uap;
/*On my particular system, compiled with gcc -O2, the offending instruction
generated for "*f = 16;" is 6 bytes. Lets try to set the instruction
pointer to the next instruction (general register 14 is EIP, on linux x86) */
context->uc_mcontext.gregs[14] += 6;
//alternativly, try to jump to a "safe place"
//context->uc_mcontext.gregs[14] = (unsigned int)safe_func;
}
int
main (int argc, char *argv[])
{
struct sigaction sa;
sa.sa_sigaction = handler;
int *f = NULL;
sigemptyset (&sa.sa_mask);
sa.sa_flags = SA_SIGINFO;
if (sigaction (SIGSEGV, &sa, 0)) {
perror ("sigaction");
exit(1);
}
//cause a segfault
*f = 16;
puts("Still Alive");
return 0;
}
$ ./a.out
SIGSEGV raised at address (nil)
Still Alive
I would beat someone with a bat if I saw something like this in production code though, it's an ugly, for-fun hack. You'll have no idea if the segfault have corrupted some of your data, you'll have no sane way of recovering and know that everything is Ok now, there's no portable way of doing this. The only mildly sane thing you could do is try to log an error (use write() directly, not any of the stdio functions - they're not signal safe) and perhaps restart the program. For those cases you're much better off writing a superwisor process that monitors a child process exit, logs it and starts a new child process.
Segmentation fault in CS50 (2020) recovery program
when compiling, always enable the warnings, then fix those warnings.
gcc -ggdb3 -Wall -Wextra -Wconversion -pedantic -std=gnu11 -c "untitled1.c" -o "untitled1.o"
results in several warnings like:
untitled1.c:46:91: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses]
if(buffer[0] == 0xff & buffer[1] == 0xd8 & buffer[2] == 0xff & (buffer[3] & 0xf0) == 0xe0)
Note: a single &
is a bit wise AND. You really want a logical AND &&
for all but the last one in this statement
regarding;
FILE *forensic_image = fopen(argv[1],"r");
Always check (!=NULL) the returned value to assure the operation was successful. If not successful (==NULL) then call
perror( "fopen failed" );
to output to stderr
both your error message and the text reason the system thinks the error occurred.
regarding:
while(!feof(forensic_image))
please read: why while( !feof() is always wrong
regarding:
FILE *forensic_image = fopen(argv[1],"r");
This is already done in the prior code block. There is absolutely no reason to do this again AND it will create problems in the code. Suggest: replacing:
if(fopen(argv[1],"r") == NULL)
{
printf("This image cannot be opened for reading\n");
return 1;
}
with:
if( (forensic_image = fopen(argv[1],"r") ) == NULL)
{
perror( "fopen for input file failed" );
exit( EXIT_FAILURE );
}
regarding:
BYTE *buffer = malloc( 512 * sizeof(BYTE) );
and later:
free( buffer );
This is a waste of code and resources. The project only needs one such instance. Suggest:
#define RECORD_LEN 512
and
unsigned char buffer[ RECORD_LEN ];
regarding;
fread(buffer, sizeof(BYTE), 512, forensic_image);
The function: fread()
returns a size_t
. You should be assigning the returned value to a size_t
variable and checking that value to assure the operation was successful. Infact, that statement should be in the while()
condition
regarding;
sprintf(filename, "%03i.jpg", JPEG_num);
This results in undefined behavior and can result in a seg fault event because the pointer filename
is initialized to NULL. Suggest:
char filename[20];
to avoid that problem
regarding:
else // If not first JPEG
{
fclose(jpeg0);
if your (for instance) working with the 3rd file, then jpeg0
is already closed, resulting in a run time error. Suggest removing the statement:
FILE *jpeg0;
and always using jpegn
regarding;
else // If already found JPEG
{
fwrite(buffer, sizeof(BYTE), 512, jpegn);
}
on the first output file, jpegn
is not set, so this results in a crash. Again, ONLY use jpegn
for all output file operations.
regarding:
fwrite(buffer, sizeof(BYTE), 512, jpegn);
this returns the number of (second parameter) amounts actually written, so this should be:
if( fwrite(buffer, sizeof(BYTE), 512, jpegn) != 512 ) { // handle error }
the posted code contains some 'magic' numbers, like 512. 'magic' numbers are numbers with no basis. 'magic' numbers make the code much more difficult to understand, debug, etc. Suggest using an enum
statement or #define
statement to give those 'magic' numbers meaningful names, then use those meaningful names throughout the code.
Fixing Segmentation faults in C++
Compile your application with
-g
, then you'll have debug symbols in the binary file.Use
gdb
to open the gdb console.Use
file
and pass it your application's binary file in the console.Use
run
and pass in any arguments your application needs to start.Do something to cause a Segmentation Fault.
Type
bt
in thegdb
console to get a stack trace of the Segmentation Fault.
debugging a program with segmentation fault
You should use realloc()
to grow the dynamic arrays when you need to put a new element in them. In your code you're doing just one allocation at the start.
For instance, write this at the start (after you scanf()
to get total_number_of_shelves
and total_number_of_queries
):
total_number_of_books = malloc(total_number_of_shelves * sizeof(int));
total_number_of_pages = malloc(total_number_of_shelves * sizeof(int*));
// TODO: you should initialize every element of the first array to `0`,
// and every element of the second array to `NULL`
And when you need to put a new element inside, you use realloc()
:
total_number_of_books[x] += 1;
total_number_of_pages[x] = realloc(total_number_of_pages[x], total_number_of_books[x] * sizeof(int));
...
Related Topics
Std::Remove_If - Lambda, Not Removing Anything from the Collection
How to Avoid Errors While Using Crtp
How to Cheaply Assign C-Style Array to Std::Vector
Strange "Unsigned Long Long Int" Behaviour
Sorting Std::Strings with Numbers in Them
How to Compile C++11 Code with Orwell Dev-C++
Why Is a Default Constructor Required When Storing in a Map
Differencebetween Include_Directories and Target_Include_Directories in Cmake
Do Function Pointers Need an Ampersand
Why Implicit Conversion Is Harmful in C++
Preferred Cmake Project Structure