Optimizing Member Variable Order in C++

Optimizing member variable order in C++

Two issues here:

Whether and when keeping certain fields together is an optimization.
How to do actually do it.

The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.

The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.

The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.

Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.

Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:

an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
followed by an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)

then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.

In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.

I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.

The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:

struct some_stuff {
    double d;   // I expect double is 64bit IEEE, it might not be
    uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
    uint32_t i; // 4 bytes, usually 4-aligned
    int32_t j;  // same
    short s;    // usually 2 bytes, could be 2-aligned or unaligned, I don't know
    char c[4];  // array 4 chars, 4 bytes big but "never" needs 4-alignment
    char d;     // 1 byte, any alignment
};

If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.

Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.

What is the optimal order of members in a class?

As a rule of thumb you should order by size, greater to smaller. This will create the minimum padding of the structure, thus minimizing the structure size. This matters most if objects of the structure are used in a contiguous allocated memory, e.g. vector and helps a lot with cache (more objects fit in the same cache line).

Another surprising optimization demonstrated by Andrei Alexandrescu (I think was in one of CppCon) is that he brought the most accessed member first. This is faster because the offset is zero. Of course he was talking about micro-optimization, after benchmarking the hell out of the application and caring for every little strop of performance.

Does member order make a performance difference in Java like in C or C++?

int types are always 32-bit and references are usually 32-bit even in 64-bit JVMs.

On the down side, Java has a 8-12 byte header at the start of each object and uses an 8-byte alignment. BTW some C++ environments have a 16-byte alignment.

Are unboxed types (such as int and boolean) the same size as references or smaller?

You can expect them to be smaller for boolean, byte, char and short, but the primitives can be larger for long and double than a reference.

And if they're smaller, is the compiler allowed to reorder them to avoid inserting padding to align subsequent fields?

The JIT can re-organise the fields or even optimise them away.

Finally, if it is, do any compilers do this?

The javac compile does next to no optimisations and looking at the byte code will give you little clues as to what will happen at runtime. The JIT can optimise the fields in an objects anyway it chooses.

I'm just curious to know if I should bear this in mind when choosing what order to declare my fields, like I do in C.

IMHO, You can assume that just about every optimisation trick you might have used in C not longer applies in Java. Of the few that do, they may not be exactly the same.

You should assume the JIT will optimise the code as required and use profilers to determine if and when you have a problem. Only then consider altering the code for performance reasons.

When are static member variables optimized away?

The definition of a static data member of a class template specialization is implicitly instantiated only if it is used in such a way that a definition would be required.

For the class B you are, unconditionally, defining the default constructor. The default constructor uses the default constructor of Super<B> to initialize the base, meaning that the definition of the Super<B>::Super() constructor will be implicitly instantiated. This constructor's definition is odr-using m in (void)m; and therefore Super<B>::m's definition will also be implicitly instantiated.

In the case of class A, you are not explicitly defining any constructor. The implicit special member functions will be defined only when they are used in such a way that a definition would be required. In the line A a {}; you are calling the implicit default constructor of A and hence it will be defined. The definition will be calling the default constructor of Super<A> as before, requiring the Super<A>::m's definition to be instantiated. Without A a {}; there is nothing in the code requiring a definition of any special member function of A or the default constructor of Super<A> or the definition of m. Therefore none of them will be defined.

In the case of C, there is no template for which we would need to consider instantiation. C::m is explicitly defined.

Given that the static data member is defined, it must (generally) be initialized eventually. All of the inline static data members here have dynamic initialization with observable side effects, so the initialization must happen at runtime. It is implementation-defined whether they will be initialized before main's body starts execution or whether initialization will be deferred upto the first non-initialization odr-use of the inline static data member. (This is meant to allow for dynamic libraries.)

You aren't actually non-initialization odr-using any of the inline static data members, so it is implementation-defined whether they will actually be initialized at all. If the implementation does define the initialization to not be deferred, then all of these inline static data members which have been defined will also be initialized before main is entered.

The order in which the initializations will happen is indeterminate. The static data members of the class template specializations have unordered initialization, meaning they have no ordering guarantees with any other dynamic initialization. And there is only one static data member which isn't specialized from a template and that one is inline and therefore only partially ordered, although there is nothing else it could be ordered with.

Actually, there is one additional static storage duration object which will be initialized here, a global variable of type std::ios_base::Init included through <iostream>. The initialization of this variable causes the initialization of the standard streams (std::cout, etc.). Because your inline static data members from the templates have unordered initialization, they will not be ordered with this initialization. Similarly if you had multiple translation units containing C::m, it would also not be ordered with it. As a consequence you might be using std::cout before it is initialized, causing undefined behavior. You can cause early initialization of the standard streams by constructing an object of type std::ios_base::Init:

class X {
public:
    X(const char* s) {
        [[maybe_unused]] std::ios_base::Init ios_base_init;
        std::cout << s << "\n";
    };
};

Aside from considerations such as above, the compiler is not allowed to remove static data members if their initialization has observable side effects. Of course the as-if rule still applies as always meaning that the compiler can compile to whatever machine instructions which will result in the same observable behavior as described above.

For practical purposes you should also be careful. There are some compiler flags that are sometimes used for code size optimization which will eliminate dynamic initialization if the variable seems to be unused. (Although that is not standard-conforming behavior.) For example the --gc-sections linker flag together with GCC's -ffunction-section -fdata-section can have this effect.

As you can see dynamic initialization of static storage duration objects is kind of complicated in C++. In your case here there are only minor dependency issues, but this can quickly become very messy, which is why it is usually recommended to avoid it as much as possible.

How do i know if the compiler will optimize a variable?

Simple example:

int flag = 1;

while (flag)
{
   do something that doesn't involve flag
}

This can be optimized to:

while (true)
{
   do something
}

because the compiler knows that flag never changes.

with this code:

volatile int flag = 1;

while (flag)
{
   do something that doesn't involve flag
}

nothing will be optimized, because now the compiler knows: "although the program doesn't change flag inside the while loop, it might changed anyway".

Any tools to optimize a structure size in C?

Assuming that the size of any member is multiplicity of its alignment requirements which are powers of 2 then the optimal layout can be found by placing the members with the strictest alignment first. There will be no internal padding between members. The total size of the struct would be a sum of its members rounded to alignment of the first member with the strictest alignment, which is a lower bound anyway.

Optimizing Member Variable Order in C++