Why Does Gdb Prompt "Unexpected Size of Section '.Reg-Xstate/Xxxxx' in Core File."

why does gdb complain that my core files are too small and then fail to produce a meaningful stack trace?

The Core Dump File Format

On a modern Linux system, core dump files are formatted using the ELF object file format, with a specific configuration.
ELF is a structured binary file format, with file offsets used as references between data chunks in the file.

  • Core Dump Files
  • The ELF object file format

For core dump files, the e_type field in the ELF file header will have the value ET_CORE.

Unlike most ELF files, core dump files make all their data available via program headers, and no section headers are present.
You may therefore choose to ignore section headers in calculating the size of the file, if you only need to deal with core files.

Calculating Core Dump File Size

To calculate the ELF file size:

  1. Consider all the chunks in the file:

    • chunk description (offset + size)
    • the ELF file header (0 + e_ehsize) (52 for ELF32, 64 for ELF64)
    • program header table (e_phoff + e_phentsize * e_phnum)
    • program data chunks (aka "segments") (p_offset + p_filesz)
    • the section header table (e_shoff + e_shentsize * e_shnum) - not required for core files
    • the section data chunks - (sh_offset + sh_size) - not required for core files
  2. Eliminate any section headers with a sh_type of SHT_NOBITS, as these are merely present to record the position of data that has been stripped and is no longer present in the file (not required for core files).
  3. Eliminate any chunks of size 0, as they contain no addressable bytes and therefore their file offset is irrelevant.
  4. The end of the file will be the end of the last chunk, which is the maximum of the offset + size for all remaining chunks listed above.

If you find the offsets to the program header or section header tables are past the end of the file, then you will not be able to calculate an expected file size, but you will know the file has been truncated.

Although an ELF file could potentially contain unaddressed regions and be longer than the calculated size, in my limited experience the files have been exactly the size calculated by the above method.

Truncated Core Files

gdb likely performs a calculation similar to the above to calculate the expected core file size.

In short, if gdb says your core file is truncated, it is very likely truncated.

One of the most likely causes for truncated core dump files is the system ulimit. This can be set on a system-wide basis in /etc/security/limits.conf, or on a per-user basis using the ulimit shell command [footnote: I don't know anything about systems other than my own].

Try the command "ulimit -c" to check your effective core file size limit:

$ ulimit -c
unlimited

Also, it's worth noting that gdb doesn't actually refuse to operate because of the truncated core file. gdb still attempts to produce a stack backtrace and in your case only fails when it tries to access data on the stack and finds that the specific memory locations addressed are off the end of the truncated core file.

GDB empty backtrace when using gunicorn core dump

in the another container I have built for debugging purpose, I've got this:

It's not clear what you mean by "I have built for debugging".

In general, the binary you use to analyze the core dump must match the binary which produced this core dump exactly.

That means you can't do this:

gcc -O2 -o foo t.c
./foo # crashes, produces core dump

gcc -g -o foo-g t.c # note lack of -O2
gdb ./foo-g core # will not work!

Instead, what you should do is this:

gcc -g -O2 -o foo-g t.c  # optimize with debug info
cp foo-g foo
strip -g foo # make a production binary by removing debug info

./foo # crashes, produces a core dump
gdb ./foo-g core # this works!

To test whether the two binaries are sufficiently "same", you can compare their symbols, e.g.

diff <(readelf -Ws foo-g) <(readelf -Ws foo)

The debug binary can have symbols not present in the stripped binary (such as LOCAL functions), but symbols that are present in both binaries must have the same value.

I am guessing that your "built for debugging" python3.5 is not the same as your "production" python3.5.

Why does boost split cause double free or corruption issue

So, boost::split doesn't crash. You have undefined behavior elsewhere.

Regardless, why are you parsing through a string, allocating a vector of strings, comparing to a temporary string etc. all the time? You could do this on the integer-domain.

Four takes. Starting from a simple skeleton:

#include <shared_mutex>
#include <string>

struct MyClass1 {
MyClass1(uint32_t owner, std::string admins)
: m_familyOwner(owner)
, m_familyAdmins(std::move(admins)) {}

bool hasFamilyAdminPermission(uint32_t uid) const;

private:
mutable std::shared_mutex m_mtx; // guards m_familyOwner and m_familyAdmins
uint32_t m_familyOwner;
std::string m_familyAdmins;
};

1. Comparing Ints, No Allocations

I'll use Boost Spirit X3:

#include <boost/spirit/home/x3.hpp>
bool MyClass1::hasFamilyAdminPermission(uint32_t uid) const {
std::shared_lock mutex(m_mtx);
if (uid == m_familyOwner)
return true;

bool matched = false;
auto element = boost::spirit::x3::uint32;
auto check = [uid, &matched](auto& ctx) {
if (_attr(ctx) == uid) {
matched = true;
_pass(ctx) = false; // short circuit for perf
}
};

parse(begin(m_familyAdmins), end(m_familyAdmins), element[check] % ',');
return matched;
}

This still does quite a lot of work under the lock, but certainly never allocates. Also, it does early-out, which helps if the collection of owners can be very large.

2. Comparing Text, But Without Allocations

With a nifty regex you can match the number as text on a constant string (or string view). The overhead here is the allocation(s) for the regex. But arguably, it's much simpler:

#include <regex>
bool MyClass2::hasFamilyAdminPermission(uint32_t uid) const {
std::shared_lock mutex(m_mtx);
if (uid == m_familyOwner)
return true;

return regex_search(m_familyAdmins, std::regex("(^|,)" + std::to_string(uid) + "(,|$)"));
}

3. Parse Once, At Construction

Why are we dealing with text? We could keep the admins in a set:

#include <set>
struct MyClass3 {
MyClass3(uint32_t owner, std::string_view admins) : m_familyOwner(owner) {
parse(admins.begin(), end(admins), boost::spirit::x3::uint32 % ',', m_familyAdmins);
}
bool hasFamilyAdminPermission(uint32_t uid) const;

private:
mutable std::shared_mutex m_mtx; // guards m_familyOwner and m_familyAdmins
uint32_t m_familyOwner;
std::set<uint32_t> m_familyAdmins;
};

bool MyClass3::hasFamilyAdminPermission(uint32_t uid) const {
std::shared_lock mutex(m_mtx);
return uid == m_familyOwner || m_familyAdmins.contains(uid);
}

That's even simpler. However, there's some overhead in the set which can be optimized.

4. Parse Once, No Allocations, Speed

std::set has the right semantics. However, for small sets it's sad that there's no locality of reference, and relatively high node allocation overhead. We could replace with:

boost::container::flat_set< //
uint32_t, //
std::less<>, //
boost::container::small_vector<uint32_t, 10>>
m_familyAdmins;

This makes it so that sets <= 10 elements do not allocate at all, and lookup benefits from contiguous storage. However, at this rate - unless you want to deal with duplicate entries - you might keep a linear search and store:

boost::container::small_vector<uint32_t, 10>
m_familyAdmins;

Combined Demo

Showing all the subtle edge cases. Note that only with the X3 parser

  • it will be easy to perform input validation on the comma-separated string
  • it will be easy to reliably compare differently formatted uid numbers

I snuck in one number that has a leading 0 (089 instead of 89) just to highlight this issue with the std::regex approach. Note that your original code has the same problem.

Live On Coliru/Compiler Explorer

#include <shared_mutex>
#include <string>

struct MyClass1 {
MyClass1(uint32_t owner, std::string admins)
: m_familyOwner(owner)
, m_familyAdmins(std::move(admins)) {}

bool hasFamilyAdminPermission(uint32_t uid) const;

private:
mutable std::shared_mutex m_mtx; // guards m_familyOwner and m_familyAdmins
uint32_t m_familyOwner;
std::string m_familyAdmins;
};

#include <boost/spirit/home/x3.hpp>
bool MyClass1::hasFamilyAdminPermission(uint32_t uid) const {
std::shared_lock mutex(m_mtx);
if (uid == m_familyOwner)
return true;

bool matched = false;
auto element = boost::spirit::x3::uint32;
auto check = [uid, &matched](auto& ctx) {
if (_attr(ctx) == uid) {
matched = true;
_pass(ctx) = false; // short circuit for perf
}
};

parse(begin(m_familyAdmins), end(m_familyAdmins), element[check] % ',');
return matched;
}

struct MyClass2 {
MyClass2(uint32_t owner, std::string admins)
: m_familyOwner(owner)
, m_familyAdmins(std::move(admins)) {}
bool hasFamilyAdminPermission(uint32_t uid) const;

private:
mutable std::shared_mutex m_mtx; // guards m_familyOwner and m_familyAdmins
uint32_t m_familyOwner;
std::string m_familyAdmins;
};

#include <regex>
bool MyClass2::hasFamilyAdminPermission(uint32_t uid) const {
std::shared_lock mutex(m_mtx);
if (uid == m_familyOwner)
return true;

return std::regex_search(m_familyAdmins, std::regex("(^|,)" + std::to_string(uid) + "(,|$)"));
}

#include <set>
struct MyClass3 {
MyClass3(uint32_t owner, std::string_view admins) : m_familyOwner(owner) {
parse(admins.begin(), end(admins), boost::spirit::x3::uint32 % ',', m_familyAdmins);
}
bool hasFamilyAdminPermission(uint32_t uid) const;

private:
mutable std::shared_mutex m_mtx; // guards m_familyOwner and m_familyAdmins
uint32_t m_familyOwner;
std::set<uint32_t> m_familyAdmins;
};

bool MyClass3::hasFamilyAdminPermission(uint32_t uid) const {
std::shared_lock mutex(m_mtx);
return uid == m_familyOwner || m_familyAdmins.contains(uid);
}

#include <boost/container/flat_set.hpp>
#include <boost/container/small_vector.hpp>
struct MyClass4 {
MyClass4(uint32_t owner, std::string_view admins) : m_familyOwner(owner) {
parse(admins.begin(), end(admins), boost::spirit::x3::uint32 % ',', m_familyAdmins);
}
bool hasFamilyAdminPermission(uint32_t uid) const;

private:
mutable std::shared_mutex m_mtx; // guards m_familyOwner and m_familyAdmins
uint32_t m_familyOwner;
#ifdef LINEAR_SEARCH
// likely faster with small sets, anyways
boost::container::small_vector<uint32_t, 10> m_familyAdmins;
#else
boost::container::flat_set< //
uint32_t, //
std::less<>, //
boost::container::small_vector<uint32_t, 10>>
m_familyAdmins;
#endif
};

bool MyClass4::hasFamilyAdminPermission(uint32_t uid) const {
std::shared_lock mutex(m_mtx);
return uid == m_familyOwner ||
#ifndef LINEAR_SEARCH
std::find(begin(m_familyAdmins), end(m_familyAdmins), uid) != end(m_familyAdmins);
#else
m_familyAdmins.contains(uid);
#endif
}

#include <iostream>
int main() {
MyClass1 const mc1{42, "21,377,34,233,55,089,144"};
MyClass2 const mc2{42, "21,377,34,233,55,089,144"};
MyClass3 const mc3{42, "21,377,34,233,55,089,144"};
MyClass4 const mc4{42, "21,377,34,233,55,089,144"};

std::cout << "uid\tdynamic\tregex\tset\tflat_set\n"
<< "\t(x3)\t-\t(x3)\t(x3)\n"
<< std::string(5 * 8, '-') << "\n";

auto compare = [&](uint32_t uid) {
std::cout << uid << "\t" << std::boolalpha
<< mc1.hasFamilyAdminPermission(uid) << "\t"
<< mc2.hasFamilyAdminPermission(uid) << "\t"
<< mc3.hasFamilyAdminPermission(uid) << "\t"
<< mc4.hasFamilyAdminPermission(uid) << "\n";
};

compare(42);
// https://en.wikipedia.org/wiki/Fibonacci_number
for (auto i = 3, j = 5; i < 800; std::tie(i, j) = std::tuple{j, i + j}) {
compare(i);
}
}

Prints

id      dynamic regex   set     flat_set
(x3) - (x3) (x3)
----------------------------------------
42 true true true true
3 false false false false
5 false false false false
8 false false false false
13 false false false false
21 true true true true
34 true true true true
55 true true true true
89 true false true true
144 true true true true
233 true true true true
377 true true true true
610 false false false false

Segmentation fault (core dumped) on tf.Session()

If you can see the nvidia-smi output, the second GPU has an ECC code of 2. This error manifests itself irrespective of a CUDA version or TF version error, and usually as a segfault, and sometimes, with the CUDA_ERROR_ECC_UNCORRECTABLE flag in the stack trace.

I got to this conclusion from this post:

"Uncorrectable ECC error" usually refers to a hardware failure. ECC is
Error Correcting Code, a means to detect and correct errors in bits
stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM
every once in a great while, but "uncorrectable ECC error" indicates
that several bits are coming out of RAM storage "wrong" - too many for
the ECC to recover the original bit values.

This could mean that you have a bad or marginal RAM cell in your GPU
device memory.

Marginal circuits of any kind may not fail 100%, but are more likely
to fail under the stress of heavy use - and associated rise in
temperature.

A reboot usually is supposed to take away the ECC error. If not, seems like the only option is to change the hardware.


So what all I did and finally how I fixed the issue?

  1. I tested my code a on a separate machcine with NVIDIA 1050 Ti
    machine and my code executed perfectly fine.
  2. I made the code run only on the first card for which the ECC
    value was normal, just to narrow down the issue. This I did
    following, this post, setting the
    CUDA_VISIBLE_DEVICES environment variable.
  3. I then requested for restart of the Tesla-K80 server to check
    whether a restart can fix this issue, they took a while but the
    server was then restarted

    Now the issue is no more and I can run both the cards for my
    tensorflow implemntations.

ROS cv_bridge::toCvCopy fails with segmentation fault

I do not why, but I solved this issue by replacing /opt/ros/kinetic/lib/libcv_bridge.so with libcv_bridge.so.0d in the libcv-bridge0d package.

Detailed procedure:

$ sudo apt-get -f install libcv-bridge0d
$ cd /opt/ros/kinetic/lib/
$ sudo mv libcv_bridge.so{,.trouble}
$ sudo ln -s /usr/lib/x86_64-linux-gnu/libcv_bridge.so.0d libcv_bridge.so

Also I had a similar trouble in Python, which could be solved by this:

$ sudo apt-get -f install python-cv-bridge


Related Topics



Leave a reply



Submit