Why Does the Enhanced Gcc 6 Optimizer Break Practical C++ Code

Why does the enhanced GCC 6 optimizer break practical C++ code?

I guess the question that needs to be answered why well-intentioned people would write the checks in the first place.

The most common case is probably if you have a class that is part of a naturally occurring recursive call.

If you had:

struct Node
{
    Node* left;
    Node* right;
};

in C, you might write:

void traverse_in_order(Node* n) {
    if(!n) return;
    traverse_in_order(n->left);
    process(n);
    traverse_in_order(n->right);
}

In C++, it's nice to make this a member function:

void Node::traverse_in_order() {
    // <--- What check should be put here?
    left->traverse_in_order();
    process();
    right->traverse_in_order();
}

In the early days of C++ (prior to standardization), it was emphasized that that member functions were syntactic sugar for a function where the this parameter is implicit. Code was written in C++, converted to equivalent C and compiled. There were even explicit examples that comparing this to null was meaningful and the original Cfront compiler took advantage of this too. So coming from a C background, the obvious choice for the check is:

if(this == nullptr) return;

Note: Bjarne Stroustrup even mentions that the rules for this have changed over the years here

And this worked on many compilers for many years. When standardization happened, this changed. And more recently, compilers started taking advantage of calling a member function where this being nullptr is undefined behavior, which means that this condition is always false, and the compiler is free to omit it.

That means that to do any traversal of this tree, you need to either:

Do all of the checks before calling traverse_in_order

void Node::traverse_in_order() {
    if(left) left->traverse_in_order();
    process();
    if(right) right->traverse_in_order();
}

This means also checking at EVERY call site if you could have a null root.

Don't use a member function
This means that you're writing the old C style code (perhaps as a static method), and calling it with the object explicitly as a parameter. eg. you're back to writing Node::traverse_in_order(node); rather than node->traverse_in_order(); at the call site.
I believe the easiest/neatest way to fix this particular example in a way that is standards compliant is to actually use a sentinel node rather than a nullptr.
```
// static class, or global variable
Node sentinel;

void Node::traverse_in_order() {
    if(this == &sentinel) return;
    ...
}
```

Neither of the first two options seem that appealing, and while code could get away with it, they wrote bad code with this == nullptr instead of using a proper fix.

I'm guessing that's how some of these code bases evolved to have this == nullptr checks in them.

Question about GCC Optimizer and why this code always returns 42?

int main(int argc, char** argv) {
    switch (argc) {
        case 1000: return 42;
        int y = 24;
        default: return y;
    }
    return argc;
}

To explain this a bit more, a switch doesn't exactly do a linear progression. The logic equivalent to this would be:

"If argc is 1000, return 42. Otherwise return y"

The int y = 24; is never used since it's never reached, the compiler can optimize this out, and since there's UB in the case of a default, it might as well return 42.

To fix this and behave the way I suspect you intend, you just need to declare y outside of the switch statement.

int main(int argc, char** argv) {
    int y = 24;
    switch (argc) {
        case 1000: return 42;
        default: return y;
    }
    return argc;
}

Why did GCC break code that calls abs function with short argument?

GCC (and clang) libraries (glibc and libc++, respectively) broke backward compatibility in order to comply with the C++ Standard.

The trouble is caused by this clause:

Moreover, there shall be additional overloads suffcient to ensure:
If any arithmetic argument corresponding to a double parameter has type long double, then all arithmetic arguments corresponding to
double parameters are effectively cast to long double.
Otherwise, if any arithmetic argument corresponding to a double parameter has type double or an integer type, then all arithmetic
arguments corresponding to double parameters are effectively cast to
double.
Otherwise, all arithmetic arguments corresponding to double parameters have type float.

short int is "an integer type", so bullet #2 kicks in and causes generation of a wrapper which calls double abs(double) and this wrapper is a better match than int abs(int).

Notably, the latest drafts of the Standard have an explicit exception added to this rule:

For each set of overloaded functions within , with the exception of abs, there shall be additional overloads suffcient to ensure:

This exception actually is derived from handling of unsigned types but solves your problem as well.

ARM GCC removing required code during optimization

In this code:

lum = ((RSCALE * r) + (GSCALE * g) + (BSCALE * b));
if(lum > 0x7FFFFFFF)

RSCALE, GSCALE, and BSCALE are 5014709, 9848225, and 1912602, respectively. Assuming int is 32 bits in the C implementation being used, these are all int constants.

r, g, and b are all of type uint8_t, so they are promoted to int in the multiplications.

Then ((RSCALE * r) + (GSCALE * g) + (BSCALE * b)) is a calculation entirely with int operands producing an int result. So the compiler can see that lum is assigned the value of an int result, and it is entitled to assume the result is in the range INT_MIN to INT_MAX. Further, it can see all operands are nonnegative, and therefore negative results are not possible, reducing the possible range to 0 to INT_MAX. This excludes the possibility that assigning a negative value to a uint32_t will cause wrapping to a high value. So the compiler may assume lum > 0x7FFFFFFF is never true.

The calculation may overflow int, and then the behavior is undefined, and the compiler is still allowed to use the assumption.

To correct this, change at least one operand of each multiplication to unsigned.

Why does the C++ optimizer use different delete when deleting the same pointer

The explanation is simple: you invoke undefined behaviour (dereferencing a null pointer) and thus anything can happen. A tell-tale sign for this kind of thing is that optimization level influences the result as you note.

The address you are seeing is rubbish in both cases. As to why exactly they are different, you'll need someone with knowledge of how GCC 7.3 handles this specific case of undefined behaviour, but for all intents and purposes, all useful C++ explanations stop at "it's undefined behaviour".

If I had to guess, the vtable and how it is optimized away in this case may have something to do with it.

Why does -O2 or greater optimization in clang break this code?

The bug is in your constructor:

 countdownTimer(duration_t duration)
 : duration{ duration }, paused{ true } {}

You forgot to initialize started. This triggers undefined behavior when you call start().

No version of clang that I have convenient access to will diagnose this error, but GCC versions 5 and 6 (on Linux - I don't have GCC on my Mac anymore) will:

$ g++ -O2 -Wall -Wextra -std=c++14 test.cc
test.cc: In function ‘int main()’:
test.cc:18:13: warning: ‘*((void*)& timer +33)’ is used uninitialized in this function [-Wuninitialized]
         if (started) return;
             ^~~~~~~
test.cc:74:20: note: ‘*((void*)& timer +33)’ was declared here
     countdownTimer timer(10s);
                    ^~~~~

(My copy of Xcode seems to be a bit out of date, with Apple LLVM version 7.0.2 (clang-700.1.81); it does not change the behavior of the program at -O2. It's possible that your clang would diagnose this error if you turned on the warnings.)

(I have filed a bug report with GCC about the IR gobbledygook in the diagnostics.)

Why does GCC generate 15-20% faster code if I optimize for size instead of speed?

My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).

Short answer:

Is it the padding that is the culprit in this case? Why and how?

It all boils down to alignment. Alignments can have a significant impact on the performance, that is why we have the -falign-* flags in the first place.

I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3 does the appropriate alignment and the generated code does not show this weird behavior.

Of course, if an inappropriate alignment is done, it makes things worse. An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.

The noise it makes pretty much makes timing micro-optimizations
impossible.

How can I make sure that such accidental lucky / unlucky alignments
are not interfering when I do micro-optimizations (unrelated to stack
alignment) on C or C++ source codes?

Simply by telling gcc to do the right alignment:

g++ -O2 -falign-functions=16 -falign-loops=16

Long answer:

The code will run slower if:

an XX byte boundary cuts add() in the middle (XX being machine dependent).
if the call to add() has to jump over an XX byte boundary and the target is not aligned.
if add() is not aligned.
if the loop is not aligned.

The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os (executes in 0.994 secs):

00000000004004fd <_ZL3addRKiS0_.isra.0>:
  4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]
  400500:       c3

a 256 byte boundary cuts add() right in the middle and neither add() nor the loop is aligned. Surprise, surprise, this is the slowest case!

In case gcc-4.7.3 -Os (executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add() is cut):

00000000004004fa <_ZL3addRKiS0_.isra.0>:
  4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004fd:       c3                      ret

[...]

  40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>

Nothing is aligned, and the call to add() has to jump over the 256 byte boundary. This code is the second slowest.

In case gcc-4.6.4 -Os (executes in 0.709 secs), although nothing is aligned, the call to add() doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:

  4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>
  4004f7:       01 c3                   add    ebx,eax
  4004f9:       ff cd                   dec    ebp
  4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>

This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.

Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either.

I first noticed in 2009 that gcc (at least on my projects and on my
machines) have the tendency to generate noticeably faster code if I
optimize for size (-Os) instead of speed (-O2 or -O3) and I have been
wondering ever since why.

A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Os instead of -O2), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. From now on, I will check the effects of alignment on my projects.

Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add() fail?

Consider this:

// add.cpp
int add(const int& x, const int& y) {
    return x + y;
}

and in a separate file:

// main.cpp
int add(const int& x, const int& y);

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

and compiled as: g++ -O2 add.cpp main.cpp.

gcc won't inline add()!

That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler. If compile the above as: g++ -O2 -flto add.cpp main.cpp, that is, if I perform link time optimization, the code runs in 0.19s!

(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).

Why Does the Enhanced Gcc 6 Optimizer Break Practical C++ Code