Debugging Core Files Generated on a Customer'S Box

Debugging core files generated on a Customer's box

What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?

It the executable is dynamically linked, as yours is, the stack GDB produces will (most likely) not be meaningful.

The reason: GDB knows that your executable crashed by calling something in libc.so.6 at address 0x00454ff1, but it doesn't know what code was at that address. So it looks into your copy of libc.so.6 and discovers that this is in select, so it prints that.

But the chances that 0x00454ff1 is also in select in your customers copy of libc.so.6 are quite small. Most likely the customer had some other procedure at that address, perhaps abort.

You can use disas select, and observe that 0x00454ff1 is either in the middle of instruction, or that the previous instruction is not a CALL. If either of these holds, your stack trace is meaningless.

You can however help yourself: you just need to get a copy of all libraries that are listed in (gdb) info shared from the customer system. Have the customer tar them up with e.g.

cd /
tar cvzf to-you.tar.gz lib/libc.so.6 lib/ld-linux.so.2 ...

Then, on your system:

mkdir /tmp/from-customer
tar xzf to-you.tar.gz -C /tmp/from-customer
gdb /path/to/binary
(gdb) set solib-absolute-prefix /tmp/from-customer
(gdb) core core  # Note: very important to set solib-... before loading core
(gdb) where      # Get meaningful stack trace!

We then advice the Customer to run a -g binary so it becomes easier to debug.

A much better approach is:

build with -g -O2 -o myexe.dbg
strip -g myexe.dbg -o myexe
distribute myexe to customers
when a customer gets a core, use myexe.dbg to debug it

You'll have full symbolic info (file/line, local variables), without having to ship a special binary to the customer, and without revealing too many details about your sources.

How do I debug a core dump that aborted in a dlopen()'ed plugin?

How do I tell gdb where the plugin was loaded, so it can figure out how to show the source and data?

GDB should do that automatically (the load addresses are contained inside the core).

All you need to do is supply the binaries that match customer's environment exactly. See also this answer.

Some basic questions about debugging core files C++/linux?

The -g option has nothing to do with the core files, but with putting debug information in the program. That is, the generated executable file will contain all symbols (e.g. function and variable names) as well as line number information (so you can find out which line a crash occurs in).

The actual core dump only contains a memory dump. Yes you can, together with the program, get a stack trace, but unless the program has debug information you can not see function names or line numbers, only their addresses.

using gdb to analyze core dump - generated by an erlang application

This GDB was configured as "x86_64-apple-darwin15.4.0".
"/Users/sad/projects/core" is not a core dump: File format not recognized
$ file core

/Users/sad/projects/core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), ...

Mac OS does not use ELF file format. We can safely assume that this core came from some other system, not the one you are trying to analyse it on.

It is still possible to analyse that core on the Mac OS system, but you need:

a cross-gdb (i.e. one that can run on Mac OS host, but can deal with ELF files for your target; it is likely that you'll have to build such GDB yourself) and
(unless you have a fully-static executable), you need complete set of shared libraries from the host on which the crash happened. See this answer.

In general, it is much easier to do the post-mortem analysis on the host where the crash happened.

why does gdb complain that my core files are too small and then fail to produce a meaningful stack trace?

The Core Dump File Format

On a modern Linux system, core dump files are formatted using the ELF object file format, with a specific configuration.
ELF is a structured binary file format, with file offsets used as references between data chunks in the file.

Core Dump Files
The ELF object file format

For core dump files, the e_type field in the ELF file header will have the value ET_CORE.

Unlike most ELF files, core dump files make all their data available via program headers, and no section headers are present.
You may therefore choose to ignore section headers in calculating the size of the file, if you only need to deal with core files.

Calculating Core Dump File Size

To calculate the ELF file size:

Consider all the chunks in the file:
- chunk description (offset + size)
- the ELF file header (0 + e_ehsize) (52 for ELF32, 64 for ELF64)
- program header table (e_phoff + e_phentsize * e_phnum)
- program data chunks (aka "segments") (p_offset + p_filesz)
- the section header table (e_shoff + e_shentsize * e_shnum) - not required for core files
- the section data chunks - (sh_offset + sh_size) - not required for core files
Eliminate any section headers with a sh_type of SHT_NOBITS, as these are merely present to record the position of data that has been stripped and is no longer present in the file (not required for core files).
Eliminate any chunks of size 0, as they contain no addressable bytes and therefore their file offset is irrelevant.
The end of the file will be the end of the last chunk, which is the maximum of the offset + size for all remaining chunks listed above.

If you find the offsets to the program header or section header tables are past the end of the file, then you will not be able to calculate an expected file size, but you will know the file has been truncated.

Although an ELF file could potentially contain unaddressed regions and be longer than the calculated size, in my limited experience the files have been exactly the size calculated by the above method.

Truncated Core Files

gdb likely performs a calculation similar to the above to calculate the expected core file size.

In short, if gdb says your core file is truncated, it is very likely truncated.

One of the most likely causes for truncated core dump files is the system ulimit. This can be set on a system-wide basis in /etc/security/limits.conf, or on a per-user basis using the ulimit shell command [footnote: I don't know anything about systems other than my own].

Try the command "ulimit -c" to check your effective core file size limit:

$ ulimit -c
unlimited

Also, it's worth noting that gdb doesn't actually refuse to operate because of the truncated core file. gdb still attempts to produce a stack backtrace and in your case only fails when it tries to access data on the stack and finds that the specific memory locations addressed are off the end of the truncated core file.

gdb symbols loaded but no symbols shown for seg fault

Is the problem the fact that the core was created on another system?

Yes, exactly.

See this answer for possible solutions.

Update:

So does this mean I can only debug the program on the system where it is both built and crashes?

It is certainly not true that you can only debug a core on the system where the binary was both built and crashed -- I debug core dumps from different systems every day, and in my case the build host, the host where the program crashed, and the host on which I debug are all separate.

One thing I just noticed: your style of loading the core: gdb -c core followed by symbol-file, doesn't work for PIE executables (at least when using GDB 10.0) -- this may be a bug in GDB.

The "regular" way of loading the core is:

gdb ./ovcc core

See if that gives you better results. (You still need to arrange for matching DSOs, as linked answer shows how to do.)

Debugging Core Files Generated on a Customer'S Box