Why does GCC generate 15-20% faster code if I optimize for size instead of speed?
My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).
Short answer:
Is it the padding that is the culprit in this case? Why and how?
It all boils down to alignment. Alignments can have a significant impact on the performance, that is why we have the -falign-*
flags in the first place.
I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3
does the appropriate alignment and the generated code does not show this weird behavior.
Of course, if an inappropriate alignment is done, it makes things worse. An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.
The noise it makes pretty much makes timing micro-optimizations
impossible.How can I make sure that such accidental lucky / unlucky alignments
are not interfering when I do micro-optimizations (unrelated to stack
alignment) on C or C++ source codes?
Simply by telling gcc to do the right alignment:
g++ -O2 -falign-functions=16 -falign-loops=16
Long answer:
The code will run slower if:
an
XX
byte boundary cutsadd()
in the middle (XX
being machine dependent).if the call to
add()
has to jump over anXX
byte boundary and the target is not aligned.if
add()
is not aligned.if the loop is not aligned.
The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os
(executes in 0.994 secs):
00000000004004fd <_ZL3addRKiS0_.isra.0>:
4004fd: 8d 04 37 lea eax,[rdi+rsi*1]
400500: c3
a 256 byte boundary cuts add()
right in the middle and neither add()
nor the loop is aligned. Surprise, surprise, this is the slowest case!
In case gcc-4.7.3 -Os
(executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add()
is cut):
00000000004004fa <_ZL3addRKiS0_.isra.0>:
4004fa: 8d 04 37 lea eax,[rdi+rsi*1]
4004fd: c3 ret
[...]
40051a: e8 db ff ff ff call 4004fa <_ZL3addRKiS0_.isra.0>
Nothing is aligned, and the call to add()
has to jump over the 256 byte boundary. This code is the second slowest.
In case gcc-4.6.4 -Os
(executes in 0.709 secs), although nothing is aligned, the call to add()
doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:
4004f2: e8 db ff ff ff call 4004d2 <_ZL3addRKiS0_.isra.0>
4004f7: 01 c3 add ebx,eax
4004f9: ff cd dec ebp
4004fb: 75 ec jne 4004e9 <_ZL4workii+0x13>
This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.
Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16
then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer
flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32
or any multiples of 16, the code is not sensitive to that either.
I first noticed in 2009 that gcc (at least on my projects and on my
machines) have the tendency to generate noticeably faster code if I
optimize for size (-Os) instead of speed (-O2 or -O3) and I have been
wondering ever since why.
A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Os
instead of -O2
), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. From now on, I will check the effects of alignment on my projects.
Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add()
fail?
Consider this:
// add.cpp
int add(const int& x, const int& y) {
return x + y;
}
and in a separate file:
// main.cpp
int add(const int& x, const int& y);
const int LOOP_BOUND = 200000000;
__attribute__((noinline))
static int work(int xval, int yval) {
int sum(0);
for (int i=0; i<LOOP_BOUND; ++i) {
int x(xval+sum);
int y(yval+sum);
int z = add(x, y);
sum += z;
}
return sum;
}
int main(int , char* argv[]) {
int result = work(*argv[1], *argv[2]);
return result;
}
and compiled as: g++ -O2 add.cpp main.cpp
.
gcc won't inline add()
!
That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler. If compile the above as: g++ -O2 -flto add.cpp main.cpp
, that is, if I perform link time optimization, the code runs in 0.19s!
(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).
Which GCC optimization flags affect binary size the most?
Most of the extra code-size for an un-optimized build is the fact that the default -O0
also means a debug build, not keeping anything in registers across statements for consistent debugging even if you use a GDB j
command to jump to a different source line in the same function. -O0
means a huge amount of store/reload vs. even the lightest level of optimization, especially disastrous for code-size on a non-CISC ISA that can't use memory source operands. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to GCC equally.
Especially for modern C++, a debug build is disastrous because simple template wrapper functions that normally inline and optimize away to nothing in simple cases (or maybe one instruction), instead compile to actual function calls that have to set up args and run a call instruction. e.g. for a std::vector
, the operator[]
member function can normally inline to a single ldr
instruction, assuming the compiler has the .data()
pointer in a register. But without inlining, every call-site takes multiple instructions1
Options that affect code-size in the actual .text
section1 the most: alignment of branch-targets in general, or just loops, costs some code-size. Other than that:
-ftree-vectorize
- make SIMD versions loops, also necessitating scalar cleanup if the compiler can't prove that the iteration count will be a multiple of the vector width. (Or that pointed-to arrays are non-overlapping if you don't userestrict
; that may also need a scalar fallback). Enabled at-O3
in GCC11 and earlier. Enabled at-O2
in GCC12 and later, like clang.-funroll-loops
/-funroll-all-loops
- not enabled by default even at-O3
in modern GCC. Enabled with profile-guided optimization (-fprofile-use
), when it has profiling data from a-fprofile-generate
build to know which loops are actually hot and worth spending code-size on. (And which are cold and thus should be optimized for size so you get fewer I-cache misses when they do run, and less eviction of other code.) PGO also influences vectorization decisions.Related to loop unrolling are heuristics (tuning knobs) that control loop peeling (fully unrolling) and how much to unroll. The normal way to set these is with
-march=native
, implying-mtune=
whatever as well.-mtune=znver3
may favour big unroll factors (at least clang does), compared to-mtune=sandybridge
or-mtune=haswell
. But there are GCC options to manually adjust individual things, as discussed in comments on gcc: strange asm generated for simple loop and in How to ask GCC to completely unroll this loop (i.e., peel this loop)?
There are options to override the weights and thresholds for other decision heuristics like inlining, too, but it's very rare you'd want to fine-tune that much unless you're working on refining the defaults, or finding good defaults for a new CPU.-Os
- optimize for size and speed, trying not to sacrifice too much speed. A good tradeoff if your code has a lot of I-cache misses, otherwise-O3
is normally faster, or at least that's the design goal for GCC. Can be worth trying different options to see if-O2
or-Os
make your code faster than-O3
across some CPUs you care about; sometimes missed-optimizations or quirks of certain microarchitectures make a difference, as in Why does GCC generate 15-20% faster code if I optimize for size instead of speed? which has actual benchmarks from GCC4.6 to 4.8 (current at the time) for a specific small loop in a test program, on quite a few different x86 and ARM CPUs, with and without-march=native
to actually tune for them. There's zero reason to expect that to be representative of other code, though, so you need to test yourself for your own codebase. (And for any given loop, small code changes could make a different compile option better on any given CPU.)And obviously
-Os
is very useful if you need your static code-size smaller to fit in some size limit.-Oz
optimizing for size only, even at a large cost in speed. GCC only very recently added this to current trunk, so expect it in GCC12 or 13. Presumably what I wrote below about clang's implementation of-Oz
being quite aggressive also applies to GCC, but I haven't yet tested it.
Clang has similar options, including -Os
. It also has a clang -Oz
option to optimize only for size, without caring about speed. It's very aggressive, e.g. on x86 using code-golf tricks like push 1; pop rax
(3 bytes total) instead of mov eax, 1
(5 bytes).
GCC's -Os
unfortunately chooses to use div
instead of a multiplicative inverse for division by a constant, costing lots of speed but not saving much if any size. (https://godbolt.org/z/x9h4vx1YG for x86-64). For ARM, GCC -Os
still uses an inverse if you don't use a -mcpu=
that implies udiv
is even available, otherwise it uses udiv
: https://godbolt.org/z/f4sa9Wqcj .
Clang's -Os
still uses a multiplicative inverse with umull
, only using udiv
with -Oz
. (or a call to __aeabi_uidiv
helper function without any -mcpu
option). So in that respect, clang -Os
makes a better tradeoff than GCC, still spending a little bit of code-size to avoid slow integer division.
Footnote 1: inlining or not for std::vector
#include <vector>
int foo(std::vector<int> &v) {
return v[0] + v[1];
}
Godbolt with gcc
with the default -O0
vs. -Os
for -mcpu=cortex-m7
just to randomly pick something. IDK if it's normal to use dynamic containers like std::vector
on an actual microcontroller; probably not.
# -Os (same as -Og for this case, actually, omitting the frame pointer for this leaf function)
foo(std::vector<int, std::allocator<int> >&):
ldr r3, [r0] @ load the _M_start member of the reference arg
ldrd r0, r3, [r3] @ load a pair of words (v[0..1]) from there into r0 and r3
add r0, r0, r3 @ add them into the return-value register
bx lr
vs. a debug build (with name-demangling enabled for the asm)
# GCC -O0 -mcpu=cortex-m7 -mthumb
foo(std::vector<int, std::allocator<int> >&):
push {r4, r7, lr} @ non-leaf function requires saving LR (the return address) as well as some call-preserved registers
sub sp, sp, #12
add r7, sp, #0 @ Use r7 as a frame pointer. -O0 defaults to -fno-omit-frame-pointer
str r0, [r7, #4] @ spill the incoming register arg to the stack
movs r1, #0 @ 2nd arg for operator[]
ldr r0, [r7, #4] @ reload the pointer to the control block as the first arg
bl std::vector<int, std::allocator<int> >::operator[](unsigned int)
mov r3, r0 @ useless copy, but hey we told GCC not to spend any time optimizing.
ldr r4, [r3] @ deref the reference (pointer) it returned, into a call-preserved register that will survive across the next call
movs r1, #1 @ arg for the v[1] operator[]
ldr r0, [r7, #4]
bl std::vector<int, std::allocator<int> >::operator[](unsigned int)
mov r3, r0
ldr r3, [r3] @ deref the returned reference
add r3, r3, r4 @ v[1] + v[0]
mov r0, r3 @ and copy into the return value reg because GCC didn't bother to add into it directly
adds r7, r7, #12 @ tear down the stack frame
mov sp, r7
pop {r4, r7, pc} @ and return by popping saved-LR into PC
@ and there's an actual implementation of the operator[] function
@ it's 15 instructions long.
@ But only one instance of this is needed for each type your program uses (vector<int>, vector<char*>, vector<my_foo>, etc.)
@ so it doesn't add up as much as each call-site
std::vector<int, std::allocator<int> >::operator[](unsigned int):
push {r7}
sub sp, sp, #12
...
As you can see, un-optimized GCC cares more about fast compile-times than even the most simple things like avoiding useless mov reg,reg
instructions even within code for evaluating one expression.
Footnote 1: metadata
If you could a whole ELF executable with metadata, not just the .text + .rodata + .data you'd need to burn to flash, then of course -g
debug info is very significant for size of the file, but basically irrelevant because it's not mixed in with the parts that are needed while running, so it just sits there on disk.
Symbol names and debug info can be stripped with gcc -s
or strip
.
Stack-unwind info is an interesting tradeoff between code-size and metadata. -fno-omit-frame-pointer
wastes extra instructions and a register as a frame pointer, leading to larger machine-code size, but smaller .eh_frame
stack unwind metadata. (strip
does not consider that "debug" info by default, even for C programs not C++ where exception-handling might need it in non-debugging contexts.)
How to remove "noise" from GCC/clang assembly output? mentions how to get the compiler to omit some of that: -fno-asynchronous-unwind-tables
omits .cfi
directives in the asm output, and thus the metadata that goes into the .eh_frame
section. Also -fno-exceptions -fno-rtti
with C++ can reduce metadata. (Run-Time Type Information for reflection takes space.)
Linker options that control alignment of sections / ELF segments can also take extra space, relevant for tiny executables but is basically a constant amount of space, not scaling with the size of the program. See also Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?
Why does gcc use the size-aware delete operator by default when optimizing?
References are to the post-C++20 draft (n4861). I am also assuming C++14 or later, which introduced size-aware deallocation functions.
For your particular example, the delete expression is required to call the size-aware operator delete
, since the type to be destroyed is complete. So GCC is behaving correctly. (see [expr.delete]/10.5)
It is however not a problem that the size-aware one is chosen, because the default behavior of the global operator delete(void*, size_t)
overload, if not replaced, is to just call the corresponding operator delete(void*)
, so your custom implementation will still be used in the end. (see [new.delete.single]/16)
There is however a recommendation in [new.delete.single]/11 that a program which replaces the operator delete
version without size_t
parameter should also replace the one with the size_t
parameter. A note clarifies that although currently the standard library supplied default behavior of the size-aware versions call the custom non-size-aware implementation anyway, that may change in future standard revisions.
Also, the compiler is allowed to elide both the call to operator new
and operator delete
in the given example to either provide storage in a different manner or more likely to just elide the whole body of main
which has no other observable side effects. So it is possible that no operator delete
is called at all.
So, to future-proof the code and avoid linter warnings add a replacement for the size-aware global overload as well
void operator delete(void *p, std::size_t) noexcept
{
::operator delete(p);
}
Also note that the standard requires the replacement of this overload to behave in such a way that it could always be replaced by a call to the size-unaware version without affecting memory allocation. (see [new.delete.single]/15)
Although required since C++14, Clang doesn't seem to enable size-aware deallocation functions yet by default. You need to add the -fsized-deallocation
flag to enable them.
Some discussion of a patch enabling it by default seems to be going on here.
Also note that your implementation of operator new
is broken. The throwing version of operator new
is not allowed to return a null pointer. So you must check the return value of malloc
and throw std::bad_alloc
if it is null.
Why does Visual Studio C++ take so long to compile
Sometimes, it doesn't matter what the specifications of your computer are. The size of some large scale Windows C++ projects can just take a lot longer because there is simply a lot more code, more functions to compile & link and or analyze. There are some things you can do to speed up the process by, using PCH (Pre-compiled Headers) or if your compiling in debug you can use the /debug:fastlink linker switch etc.
You can read more about ways to make MSVC faster here: Microsoft Devblogs Recommendations
Related Topics
Forward Declaration with Unique_Ptr
Accessing Environment Variables in C++
Why Does This Delay-Loop Start to Run Faster After Several Iterations with No Sleep
How to Avoid Precompiled Headers
May Std::Vector Make Use of Small Buffer Optimization
C++: Safe to Use Longjmp and Setjmp
How to Call MAChine Code Stored in Char Array
Linux Equivalent for Conio.H Getch()
Gdb Shows Incorrect Arguments of Functions for Stack Frames
Why Should the Assignment Operator Return a Reference to the Object
Blending Does Not Remove Seams in Opencv
May Std::Vector Make Use of Small Buffer Optimization
Opengl - Mouse Coordinates to Space Coordinates
Declaring Multiple Object Pointers on One Line Causes Compiler Error