In a Sigill Handler, How to Skip The Offending Instruction

In a SIGILL handler, how can I skip the offending instruction?

It's very hacky and UNPORTABLE but:

void sighandler (int signo, siginfo_t si, void *data) {
    ucontext_t *uc = (ucontext_t *)data;

    int instruction_length = /* the length of the "instruction" to skip */

    uc->uc_mcontext.gregs[REG_RIP] += instruction_length;
}

install the sighandler like that:

struct sigaction sa, osa;
sa.sa_flags = SA_ONSTACK | SA_RESTART | SA_SIGINFO;
sa.sa_sigaction = sighandler;
sigaction(SIGILL, &sa, &osa);

That could work if you know how far to skip (and it's a Intel proc) :-)

Can a C program continue execution after a signal is handled?

Yes, that's what signal handlers are for. But some signals need to be handled specially in order to allow the program to continue (e.g. SIGSEGV, SIGFPE, …).

See the manpage of sigaction:

According to POSIX, the behavior of a process is undefined after it ignores a SIGFPE, SIGILL, or SIGSEGV signal that was not
generated by kill(2) or raise(3). Integer division by zero has undefined result. On some architectures it will generate a
SIGFPE signal. (Also dividing the most negative integer by -1 may generate SIGFPE.) Ignoring this signal might lead to an
endless loop.

Right now, you are ignoring the signal, by not doing anything to prevent it from happening (again). You need the execution context in the signal handler and fix it up manually, which involves overwriting some registers.

If SA_SIGINFO is specified in sa_flags, then sa_sigaction (instead of
sa_handler) specifies the signal-handling function for signum. This
function receives the signal number as its first argument, a pointer
to a siginfo_t as its second argument and a pointer to a ucontext_t
(cast to void *) as its third argument. (Commonly, the handler
function doesn't make any use of the third argument. See
getcontext(2) for further information about ucontext_t.)

The context allows access to the registers at the time of fault and needs to be changed to allow your program to continue. See this lkml post. As mentioned there, siglongjmp might also be an option. The post also offers a rather reusable solution for handling the error, without having to make variables global etc.:

And because you handle it youself, you have any flexibility you want
to with error handling. For example, you can make the fault handler
jump to some specified point in your function with something like
this:

 __label__ error_handler;   
 __asm__("divl %2"      
         :"=a" (low), "=d" (high)       
         :"g" (divisor), "c" (&&error_handler))     
 ... do normal cases ...

 error_handler:     
     ... check against zero division or overflow, so  whatever you want to ..

Then, your handler for SIGFPE needs only to do something like

context.eip = context.ecx;

How can I tell whether SIGILL originated from an illegal instruction or from kill -ILL?

A bona fide SIGILL will have an si_code of one of the ILL_ values (e.g., ILL_ILLADR). A user-requested SIGILL will have an si_code of one of the SI_ values (often SI_USER).

The relevant POSIX values are:

[Kernel-generated]
ILL_ILLOPC  Illegal opcode.
ILL_ILLOPN  Illegal operand.
ILL_ILLADR  Illegal addressing mode.
ILL_ILLTRP  Illegal trap.
ILL_PRVOPC  Privileged opcode.
ILL_PRVREG  Privileged register.
ILL_COPROC  Coprocessor error.
ILL_BADSTK  Internal stack error.

[User-requested]
SI_USER     Signal sent by kill().
SI_QUEUE    Signal sent by the sigqueue().
SI_TIMER    Signal generated by expiration of a timer set by timer_settime().
SI_ASYNCIO  Signal generated by completion of an asynchronous I/O request.
SI_MESGQ    Signal generated by arrival of a message on an empty message queue.

For example, the recipe in this question gives me ILL_ILLOPN, whereas kill(1) and kill(2) gives me zero (SI_USER).

Of course, your implementation might add values to the POSIX list. Historically, user- or process-generated si_code values were <= 0, and this is still quite common. Your implementation might have a convenience macro to assist here, too. For example, Linux provides:

#define SI_FROMUSER(siptr)      ((siptr)->si_code <= 0)
#define SI_FROMKERNEL(siptr)    ((siptr)->si_code > 0)

What happened if assembly code jump to a address contain bad instruction?

bits is bits.

The processor cannot possibly know that an address points at a bad instruction. Processors are incredibly dumb. They do what they are told, what they are programmed to do. Just like a train on tracks if you happen to leave a gap in one or both tracks or the tracks do not line up the train is probably going to crash. Or it might roll along upright until it hits a house or something.

The processor (arm, intel, etc are irrelevant same answer) will take the next byte(s) it finds per its rules (linear execution, branching, etc) and try to decode and execute them as an instruction. If those bytes are "bad" as in an invalid instruction, then some/many/most processors will raise an exception and do the per-ISA-defined solution (call an exception handler, hang, reset, etc). If the bytes are bad as in not the instruction you intended but the bit/byte pattern just happens to be a valid instruction. It will execute it because processors are very very very dumb, they do what they are programmed to do, no exceptions.

So there is no wondering what will happens if...The processor will try to execute the bytes/bits found as it does for every single instruction cycle, branch or no branch. If the encoded branch address violates the ISA then same answer it will do whatever the ISA has defined for that fault.

Now on to disassemblers. Any variable length instruction set (x86 definitely, ARM with arm and thumb and thumb2 also a problem) assume you cannot disassemble and assume the disassembly is bad. Put very very very little faith in instructions that look bad or that are going off in the weeds (bl to bad places, the bl disassembly itself may be the problem not the destination). The only good way to deal with a variable length instruction set is to disassemble from a known good entry point and in execution order not linearly through memory. And with that and particularly with ARM but also others, you will end up with a good portion of the binary as unable to be disassembled because you cannot statically determine some of the execution paths, you have to actually execute, simulate or as a human visually examine, the code to find some of the execution paths. And some disassemblers are worse than others and combinations if disassemblers and instruction sets make for unusuable output. It is pretty easy to watch gnu objdump fail miserably with x86 code. If you know what you are doing you can make the objdump output absolutely dreadful (for x86) and not even remotely close to being correct. Arm with thumb and thumb2 same answer. risc-v, etc.

Why is a segmentation fault not recoverable?

When exactly does segmentation fault happen (=when is SIGSEGV sent)?

When you attempt to access memory you don’t have access to, such as accessing an array out of bounds or dereferencing an invalid pointer. The signal SIGSEGV is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".

Why is the process in undefined behavior state after that point?

Because one or several of the variables in the program didn’t behave as expected. Let’s say you have some array that is supposed to store a number of values, but you didn’t allocate enough room for all them. So only those you allocated room for get written correctly, and the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.

Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such segmentation faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.

If we look at the behavior of "bare metal" microcontroller systems without any OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example, overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.

Why is it not recoverable?

Because the OS doesn’t know what your program is supposed to be doing.

Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren’t allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.

Why does this solution avoid that unrecoverable state? Does it even?

That solution is just ignoring the error and keeps on going. It doesn’t fix the problem that caused it. It’s a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.

We have to realize that a segmentation fault is a symptom of a bug, not the cause.