When Extending a Padded Struct, Why Can't Extra Fields Be Placed in the Tail Padding

When extending a padded struct, why can't extra fields be placed in the tail padding?

Short answer (for the C++ part of the question): The Itanium ABI for C++ prohibits, for historical reasons, using the tail padding of a base subobject of POD type. Note that C++11 does not have such a prohibition. The relevant rule 3.9/2 that allows trivially-copyable types to be copied via their underlying representation explicitly excludes base subobjects.

Long answer: I will try and treat C++11 and C at once.

The layout of S1 must include padding, since S1::a must be aligned for int, and an array S1[N] consists of contiguously allocated objects of type S1, each of whose a member must be so aligned.
In C++, objects of a trivially-copyable type T that are not base subobjects can be treated as arrays of sizeof(T) bytes (i.e. you can cast an object pointer to an unsigned char * and treat the result as a pointer to the first element of a unsigned char[sizeof(T)], and the value of this array determines the object). Since all objects in C are of this kind, this explains S2 for C and C++.
The interesting cases remaining for C++ are:
1. base subobjects, which are not subject to the above rule (cf. C++11 3.9/2), and
2. any object that is not of trivially-copyable type.

For 3.1, there are indeed common, popular "base layout optimizations" in which compilers "compress" the data members of a class into the base subobjects. This is most striking when the base class is empty (∞% size reduction!), but applies more generally. However, the Itanium ABI for C++ which I linked above and which many compilers implement forbids such tail padding compression when the respective base type is POD (and POD means trivially-copyable and standard-layout).

For 3.2 the same part of the Itanium ABI applies, though I don't currently believe that the C++11 standard actually mandates that arbitrary, non-trivially-copyable member objects must have the same size as a complete object of the same type.

Previous answer kept for reference.

I believe this is because S1 is standard-layout, and so for some reason the S1-subobject of S3 remains untouched. I'm not sure if that's mandated by the standard.

However, if we turn S1 into non-standard layout, we observe a layout optimization:

struct EB { };

struct S1 : EB {   // not standard-layout
    EB eb;
    int a;
    char b;
};

struct S3 : S1 {
    char c;
};

Now sizeof(S1) == sizeof(S3) == 12 on my platform. Live demo.

And here is a simpler example:

struct S1 {
private:
    int a;
public:
    char b;
};

struct S3 : S1 {
    char c;
};

The mixed access makes S1 non-standard-layout. (Now sizeof(S1) == sizeof(S3) == 8.)

Update: The defining factor seems to be triviality as well as standard-layoutness, i.e. the class must be POD. The following non-POD standard-layout class is base-layout optimizable:

struct S1 {
    ~S1(){}
    int a;
    char b;
};

struct S3 : S1 {
    char c;
};

Again sizeof(S1) == sizeof(S3) == 8. Demo

If I lay out the fields of my struct so they shouldn't need any padding, can a conforming C++ compiler add extra anyway?

You are correct, that C++ may pad arbitrarily. From C++.11 §9.2¶14 (emphasis is mine):

Nonstatic data members of a (non-union) class with the same access control (Clause 11) are allocated so that later members have higher addresses within a class object. The order of allocation of non-static data members with different access control is unspecified (11). Implementation alignment requirements might cause two adjacent members not to be allocated immediately after each other; so might requirements for space for managing virtual functions (10.3) and virtual base classes (10.1).

C is also permitted to add padding bytes, so this is not peculiar to C++. From C.11 §6.7.2.1¶15 (emphasis is mine):

Within a structure object, the non-bit-field members and the units in which bit-fields reside have addresses that increase in the order in which they are declared. A pointer to a structure object, suitably converted, points to its initial member (or if that member is a bit-field, then to the unit in which it resides), and vice versa. There may be unnamed padding within a structure object, but not at its beginning.

If you want to avoid padding, the only maximally portable way is to pack the data structure yourself into contiguous memory (e.g., a vector) when sending, and unpack the serialized data into your data structure when receiving. Your compiler may provide extensions to allow you to keep all members within your struct contiguous (e.g., GCC's packed attribute, or VC++'s pack pragma, as described here).

Inlining struct member

We must allow the member to be a potentially-overlapping subobject. There are two ways: 1. Instead of using a member, inherit from Bar or 2. Use the attribute [[no_unique_address]] (C++20).

Technically, being potentially-overlapping is sufficient to allow the compiler to re-use the padding. However, unfortunately being able to reuse the padding doesn't guarantee the re-use and in practice, certain language implementations do not re-use padding of trivially copyable standard layout types. We can work around this by making the sub object non-trivially-copyable or non-standard-layout.

I can't modify [Bar]

Then making it non-trivially-copyable-standard-layout type won't be possible, and therefore certain language implementations won't reuse the padding.

Is there a clever way of avoiding extra padding with nested classes in C++?

I explicitly rely on the permission to propose code which is "dirty or bad looking as" ... anything. To be even more clear, I only provide an idea. You need to test yourself and take responsibility yourself. I consider this question to explicitly allow untested code.

With this code:

typedef union
{
    struct
    {
        double d;   // 8 bytes
        bool b1;    //+1 byte (+ 7 bytes padding) = 16 bytes
    } nested;
    struct
    {
        double d;       // 8 bytes
        bool b1, b2;    //+2 byte (+ 6 bytes padding) = 16 bytes
    } packed;
} t_both;

I would expect the following attributes/features:

contains the substruct as potentially typedefed elsewhere (can be used from an included header file)
substruct accessable as XXX.nested.d and XXX.nested.b1
at same address as XXX.packed
access to XXX.packed.b2 to what is considered padding within nested
both substructs have the same total size, which I hope means that even making arrays of this is OK

Whatever you do with this, it probably conflicts with the requirement that when writing and reading a union, then all read accesses must be to the same part of the union as the most recent write. Writing one and reading the other would hence not be strictly allowed. That is what I consider unclearn about this code proposal. That said, I have often used this kind of unions in environments for which the respective construct has explicity been tested.

In order to illustrate here is a functionally identical and also equally unclean version, which better illustrates that the substruct can be typdefed elsewhere:


/* Inside an included header "whatever.h" : */
typedef struct
{
    double d;   // 8 bytes
    bool b1;    //+1 byte (+ 7 bytes padding) = 16 bytes
} t_ExternDefedStruct;

/* Content of including file */

#include "whatever.h"

typedef union
{
    t_ExternDefedStruct nested;
    struct
    {
        double d;       // 8 bytes
        bool b1, b2;    //+2 byte (+ 6 bytes padding) = 16 bytes
    } packed;
} t_both;

Is the memory layout of C++ single inheritance the same as this C code?

According to Adding a default constructor to a base class changes sizeof() a derived type and When extending a padded struct, why can't extra fields be placed in the tail padding? memory layout can change even if you just add constructor to MyDerived or make it non POD any other way. So I am afraid there is no such guarantee. Practically you can make it work using proper compile time asserts validating the same memory layout for both structures, but such solution does not seem to be safe and supported by standard C++.

On another side why your C++ wrapper MyDerived cannot inherit from Derived? This would be safe (as it can be safe when Derived is casted to Base and back, but I assume that is out of your control). It may change initialization code in MyDerived::MyDerived() to more verbose, but I guess that is small price for proper solution.

Adding a default constructor to a base class changes sizeof() a derived type

GCC follows the Itanium ABI for C++, which prevents the tail-padding of a POD being used for storage of derived class data members.

Adding a user-provided constructor means that Foo is no longer POD, so that restriction does not apply to Bar.

See this question for more detail on the ABI.

Why can't a destructor be marked constexpr?

As per the draft basic.types#10 possibly cv-qualified class type that has all of the following properties:

A possibly cv-qualified class type that has all of the following properties:

(10.5.1) - it has a trivial destructor,

(10.5.2) - it is either a closure type, an aggregate type, or has at
least one constexpr constructor or constructor template (possibly
inherited from a base class) that is not a copy or move constructor,

(10.5.3) - if it is a union, at least one of its non-static data
members is of non-volatile literal type

(10.5.4) - if it is not
a union, all of its non-static data members and base classes are of
non-volatile literal types.

Ques 1: Why a destructor cannot be marked as constexpr?

Because only trivial destructors are qualified for constexpr
Following is the relevant section of the draft

A destructor is trivial if it is not user-provided and if:

(5.4) — the destructor is not virtual,

(5.5) — all of the direct base classes of its class have trivial
destructors, and

(5.6) — for all of the non-static data members of its class that are
of class type (or array thereof), each such class has a trivial
destructor.

Otherwise, the destructor is non-trivial.

Ques 2: If I do not provide a destructor, is the implicitly generated destructor constexpr?

Yes, because implicitly generated destructor is trivial type, so it is qualified for constexpr

Ques 3: If I declare a defaulted destructor (~X() = default;), is it automatically constexpr?

Indeed, this destructor is user-declared and implicitly-generated and thus it is qualified for constexpr.

I'm not able to find any direct reference that only trivial destructors are qualified for constexpr but if the destructor is not trivial then it is for sure that class type is not cv-qualified. So it kind of implicit as you can't define a destructor for cv-qualified class.

C++20 Update

Since C++20, user defined destructors can also be constexpr under certain conditions.

dcl.constexpr/3:

The definition of a constexpr function shall satisfy the following
requirements:

its return type (if any) shall be a literal type;

each of its parameter types shall be a literal type;

it shall not be a coroutine ([dcl.fct.def.coroutine]);

if the function is a constructor or destructor, its class shall not have any
virtual base classes;

its function-body shall not enclose ([stmt.pre])

a goto statement,

an identifier label ([stmt.label]),

a definition of a variable of non-literal type or of static or thread
storage duration.

How do I organize members in a struct to waste the least space on alignment?

(Don't apply these rules without thinking. See ESR's point about cache locality for members you use together. And in multi-threaded programs, beware false sharing of members written by different threads. Generally you don't want per-thread data in a single struct at all for this reason, unless you're doing it to control the separation with a large alignas(128). This applies to atomic and non-atomic vars; what matters is threads writing to cache lines regardless of how they do it.)

Rule of thumb: largest to smallest alignof(). There's nothing you can do that's perfect everywhere, but by far the most common case these days is a sane "normal" C++ implementation for a normal 32 or 64-bit CPU. All primitive types have power-of-2 sizes.

Most types have alignof(T) = sizeof(T), or alignof(T) capped at the register width of the implementation. So larger types are usually more-aligned than smaller types.

Struct-packing rules in most ABIs give struct members their absolute alignof(T) alignment relative to the start of the struct, and the struct itself inherits the largest alignof() of any of its members.

Put always-64-bit members first (like double, long long, and int64_t). ISO C++ of course doesn't fix these types at 64 bits / 8 bytes, but in practice on all CPUs you care about they are. People porting your code to exotic CPUs can tweak struct layouts to optimize if necessary.
then pointers and pointer-width integers: size_t, intptr_t, and ptrdiff_t (which may be 32 or 64-bit). These are all the same width on normal modern C++ implementations for CPUs with a flat memory model.
Consider putting linked-list and tree left/right pointers first if you care about x86 and Intel CPUs. Pointer-chasing through nodes in a tree or linked list has penalties when the struct start address is in a different 4k page than the member you're accessing. Putting them first guarantees that can't be the case.
then long (which is sometimes 32-bit even when pointers are 64-bit, in LLP64 ABIs like Windows x64). But it's guaranteed at least as wide as int.
then 32-bit int32_t, int, float, enum. (Optionally separate int32_t and float ahead of int if you care about possible 8 / 16-bit systems that still pad those types to 32-bit, or do better with them naturally aligned. Most such systems don't have wider loads (FPU or SIMD) so wider types have to be handled as multiple separate chunks all the time anyway).
ISO C++ allows int to be as narrow as 16 bits, or arbitrarily wide, but in practice it's a 32-bit type even on 64-bit CPUs. ABI designers found that programs designed to work with 32-bit int just waste memory (and cache footprint) if int was wider. Don't make assumptions that would cause correctness problems, but for "portable performance" you just have to be right in the normal case.
People tuning your code for exotic platforms can tweak if necessary. If a certain struct layout is perf-critical, perhaps comment on your assumptions and reasoning in the header.
then short / int16_t
then char / int8_t / bool
(for multiple bool flags, especially if read-mostly or if they're all modified together, consider packing them with 1-bit bitfields.)

(For unsigned integer types, find the corresponding signed type in my list.)

A multiple-of-8 byte array of narrower types can go earlier if you want it to. But if you don't know the exact sizes of types, you can't guarantee that int i + char buf[4] will fill an 8-byte aligned slot between two doubles. But it's not a bad assumption, so I'd do it anyway if there was some reason (like spatial locality of members accessed together) for putting them together instead of at the end.

Exotic types: x86-64 System V has alignof(long double) = 16, but i386 System V has only alignof(long double) = 4, sizeof(long double) = 12. It's the x87 80-bit type, which is actually 10 bytes but padded to 12 or 16 so it's a multiple of its alignof, making arrays possible without violating the alignment guarantee.

And in general it gets trickier when your struct members themselves are aggregates (struct or union) with a sizeof(x) != alignof(x).

Another twist is that in some ABIs (e.g. 32-bit Windows if I recall correctly) struct members are aligned to their size (up to 8 bytes) relative to the start of the struct, even though alignof(T) is still only 4 for double and int64_t.

This is to optimize for the common case of separate allocation of 8-byte aligned memory for a single struct, without giving an alignment guarantee. i386 System V also has the same alignof(T) = 4 for most primitive types (but malloc still gives you 8-byte aligned memory because alignof(maxalign_t) = 8). But anyway, i386 System V doesn't have that struct-packing rule, so (if you don't arrange your struct from largest to smallest) you can end up with 8-byte members under-aligned relative to the start of the struct.

Most CPUs have addressing modes that, given a pointer in a register, allow access to any byte offset. The max offset is usually very large, but on x86 it saves code size if the byte offset fits in a signed byte ([-128 .. +127]). So if you have a large array of any kind, prefer putting it later in the struct after the frequently used members. Even if this costs a bit of padding.

Your compiler will pretty much always make code that has the struct address in a register, not some address in the middle of the struct to take advantage of short negative displacements.

Eric S. Raymond wrote an article The Lost Art of Structure Packing. Specifically the section on Structure reordering is basically an answer to this question.

He also makes another important point:

9. Readability and cache locality
While reordering by size is the simplest way to eliminate slop, it’s not necessarily the right thing. There are two more issues: readability and cache locality.

In a large struct that can easily be split across a cache-line boundary, it makes sense to put 2 things nearby if they're always used together. Or even contiguous to allow load/store coalescing, e.g. copying 8 or 16 bytes with one (unaliged) integer or SIMD load/store instead of separately loading smaller members.

Cache lines are typically 32 or 64 bytes on modern CPUs. (On modern x86, always 64 bytes. And Sandybridge-family has an adjacent-line spatial prefetcher in L2 cache that tries to complete 128-byte pairs of lines, separate from the main L2 streamer HW prefetch pattern detector and L1d prefetching).

Fun fact: Rust allows the compiler to reorder structs for better packing, or other reasons. IDK if any compilers actually do that, though. Probably only possible with link-time whole-program optimization if you want the choice to be based on how the struct is actually used. Otherwise separately-compiled parts of the program couldn't agree on a layout.

(@alexis posted a link-only answer linking to ESR's article, so thanks for that starting point.)

When Extending a Padded Struct, Why Can't Extra Fields Be Placed in the Tail Padding