Debugging Sigbus on X86 Linux

Debugging SIGBUS on x86 Linux

You can get a SIGBUS from an unaligned access if you turn on the unaligned access trap, but normally that's off on an x86. You can also get it from accessing a memory mapped device if there's an error of some kind.

Your best bet is using a debugger to identify the faulting instruction (SIGBUS is synchronous), and trying to see what it was trying to do.

Can I rule out that SIGBUS is raised by a minor page fault ? (Kernel log has no allocation failure)

Thanks everyone for your support. It was indeed a transient IO error. It seems the SIGBUS read-fault path doesn't necessarily log anything in the kernel log, unlike the cases I'm used to seeing for IO errors.

https://marc.info/?l=linux-ide&m=152232081917215&w=2

v4.15 intermittent errors on suspend/resume

To anyone waiting for the other show to drop on the SATA LPM work...

I've found something that's at least in the same area. It triggered a
fsck on my system 2 days ago. Evidence suggests it's occurred on many
other machines. I felt that was reason enough to give you a heads up
:).

I checked and I don't seem to have LPM enabled during runtime, even
when running on battery. My errors are all on suspend/resume, so
maybe that behaviour was changed at the same time?

It doesn't always show in kernel logs. What I first noticed was a
mysterious SIGBUS that kills Xwayland (and hence the entire Gnome
session) on resume from suspend. It surprised me to learn that this
SIGBUS can happen, without leaving anything like the read errors I'm
used to seeing in the kernel log!

My coredumps show the SIGBUS fault address is an instruction read
inside the program code of Xwayland. The backtraces vary along the
same call chain - the common factor is that they're always at the
first instruction of the function. I assume it varies according to
which page is not currently in-core, and hence triggers the failing
read request.

There are hundreds of backtraces along this same call chain from
other users, reported automatically to Fedora, that look the same.
At least so far we don't have any more plausible for them. I admit
it's funny that Xwayland is so prominent, and I haven't been swamped
with SIGBUS in other processes, but I stand by this analysis.

These crashes started within 24 hours of Fedora upgrading to kernel
v4.15.

Fedora bug for the Xwayland SIGBUS:
https://bugzilla.redhat.com/show_bug.cgi?id=1553979

My duplicate bug I've been spamming with puzzled comments:
https://bugzilla.redhat.com/show_bug.cgi?id=1557682

The earliest and biggest of the many crash report buckets:

[2018-02-17] https://retrace.fedoraproject.org/faf/reports/2049964/

[315 reports] https://retrace.fedoraproject.org/faf/reports/2055378/

EXT4 filesystem error:
Mar 27 11:28:30 alan-laptop kernel: PM: suspend exit
...
Mar 27 11:28:30 alan-laptop kernel: EXT4-fs error (device dm-2):  ext4_find_entry:1436: inode #5514052: comm thunderbird: reading directory lblock 0
Mar 27 11:28:30 alan-laptop kernel: Buffer I/O error on dev dm-2, logical block 0, lost sync page write
(this marked the FS as needing fsck on next boot)
More frequently, it logs these swap errors:
Mar 02 18:47:03 alan-laptop kernel: Restarting tasks ...
Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:836184)
Mar 02 18:47:06 alan-laptop kernel: Read-error on swap-device (253:1:580280)
My laptop LPM status, even after removing AC power:
$ head /sys/class/scsi_host/host*/link_power_management_policy
==> /sys/class/scsi_host/host0/link_power_management_policy <==
max_performance

==> /sys/class/scsi_host/host1/link_power_management_policy <==
max_performance
My laptop is a Dell Lattitude E5450. CPU is i5-5300U (a Broadwell).

Catch SIGBUS in C and C++

You can't use printf or cout in a signal handler. Nor can you call exit. You got lucky with printf this time, but you weren't as lucky with cout. If your program is in a different state maybe cout will work and printf won't. Or maybe neither, or both. Check the documentation of your operating system to see which functions are signal safe (if it exists, it's often very badly documented).

Your safest bet in this case is to call write to STDERR_FILENO directly and then call _exit (not exit, that one is unsafe in a signal handler). On some systems it's safe to call fprintf to stderr, but I'm not sure if glibc is one of them.

Edit: To answer your added question, you need to set up your signal handlers with sigaction to get the additional information. This example is as far as I'd go inside a signal handler, I included an alternative method if you want to go advanced. Notice that write is theoretically unsafe because it will break errno, but since we're doing _exit it will be safe in this particular case:

#include <stdlib.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>

void
bus_handler(int sig, siginfo_t *si, void *vuctx)
{
        char buf[2];
#if 1
        /*                                                                                                                           
         * I happen to know that si_code can only be 1, 2 or 3 on this                                                               
         * particular system, so we only need to handle one digit.                                                                   
         */
        buf[0] = '0' + si->si_code;
        buf[1] = '\n';
        write(STDERR_FILENO, buf, sizeof(buf));
#else
        /*                                                                                                                           
         * This is a trick I sometimes use for debugging , this will                                                                 
         * be visible in strace while not messing with external state too                                                            
         * much except breaking errno.                                                                                                
         */
        write(-1, NULL, si->si_code);
#endif
        _exit(1);
}

int
main(int argc, char **argv)
{
        struct sigaction sa;
        char *cptr;
        int *iptr;

        memset(&sa, 0, sizeof(sa));

        sa.sa_sigaction = bus_handler;
        sa.sa_flags = SA_SIGINFO;
        sigfillset(&sa.sa_mask);
        sigaction(SIGBUS, &sa, NULL);

#if defined(__GNUC__)
# if defined(__i386__)
        /* Enable Alignment Checking on x86 */
        __asm__("pushf\norl $0x40000,(%esp)\npopf");
# elif defined(__x86_64__)
        /* Enable Alignment Checking on x86_64 */
        __asm__("pushf\norl $0x40000,(%rsp)\npopf");
# endif
#endif

        /* malloc() always provides aligned memory */
        cptr = (char*)malloc(sizeof(int) + 1);

        /* Increment the pointer by one, making it misaligned */
        iptr = (int *) ++cptr;

        /* Dereference it as an int pointer, causing an unaligned access */

        *iptr = 42;

        return 0;
}

How to catch data-alignment faults on x86 (aka SIGBUS on Sparc)

To expand on Vokuhila-Oliba's answer looking at the "SOF Mis-aligned pointers on x86." thread it seems that gcc can generate code with mis-aligned memory access. AFAIK you don't have any control over this.

Enabling alignment checks on gcc compiled code would be a bad idea. You risk getting SIGBUS errors for good C code.

ReEdited: Sorry about that

Can invalid Read/Write cause SIGBUS Error?

int *iptr = (int *) ++cptr; 
*iptr = 42; //SIGBUS

violates multiple parts of the C standard.

You're running afoul of 6.3.2.3 Pointers, paragraph 7:

A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

as well as violating the strict-aliasing rule of 6.5 Expressions, paragraph 7:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:

a type compatible with the effective type of the object,

a qualified version of a type compatible with the effective type of the object,

a type that is the signed or unsigned type corresponding to the effective type of the object,

a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,

an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or

a character type.

Per the Valgrind documentation for Memcheck:

4.1. Overview

Memcheck is a memory error detector. It can detect the following
problems that are common in C and C++ programs.

Accessing memory you shouldn't, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory
after it has been freed.

Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.

Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus
free/delete/delete[]

Overlapping src and dst pointers in memcpy and related functions.

Passing a fishy (presumably negative) value to the size parameter of a memory allocation function.

Memory leaks.

Note that your code

int *iptr = (int *) ++cptr; 
*iptr = 42; //SIGBUS

does none of the things Valgrind claims to detect. You're not accessing memory you don't have permission to access, nor are you accessing memory outside the bounds of the region you created with malloc(). You haven't free()'d the memory yet. You have no uninitialized variables, you're not double-free()ing memory, nor are you using memcpy() improperly with overlapping source and destination regions. And you're not passing negative/"fishy" sizes to allocation functions. And you're not leaking any memory.

So, no, Valgrind doesn't even claim to be able to detect code that will cause a SIGBUS.

Structure assignment in Linux fails in ARM but succeeds in x86

You said it yourself: there are memory alignment restrictions on your particular processor, and buffer is not aligned right to permit reading larger than a byte from it. The assignment is probably compiled into three moves of larger entities.

With memcpy(), there are no alignment restrictions, it has to be able to copy between any two addresses, so it does whatever is needed to implement that. Probably copying byte-by-byte until the addresses are aligned, that's a common pattern.

As an aside, I find it clearer to write your code without array indexing:

extern const void *buffer;
const foo my_foo = *(const foo *) buffer;

What could cause std::difftime to create a SIGBUS crash?

There are only a few reasons that a program may receive SIGBUS on Linux. Several are listed in answers to this question.

Look in /var/log/messages around the time of the crash, it is likely that you'll find that there was a disk failure, or some other cause for kernel unhappiness.

Another (unlikely) possibility is that someone updated libstdc++.so.6 while your program was running, and has done so incorrectly (by writing over existing file, rather than removing it and creating new file in its place).

Debugging Sigbus on X86 Linux