What is the default optimization level, if I use the cargo build --release command?
According to the cargo manual, the default level for release builds is -O3
.
# The release profile, used for `cargo build --release` (and the dependencies
# for `cargo test --release`, including the local library or binary).
[profile.release]
opt-level = 3
debug = false
split-debuginfo = '...' # Platform-specific.
debug-assertions = false
overflow-checks = false
lto = false
panic = 'unwind'
incremental = false
codegen-units = 16
rpath = false
What's the best g++ optimization level when building a debug target?
GCC 4.8 introduces a new optimization level: -Og
for the best of both worlds.
-Og
Optimize debugging experience. -Og enables optimizations that do not interfere with
debugging. It should be the optimization level of choice for the standard
edit-compile-debug cycle, offering a reasonable level of optimization while maintaining
fast compilation and a good debugging experience.
This way some optimization is done so you get better performance, better possibly-uninitialized variable detection and you can also step through a program in GDB without jumping back-and-forth through the function.
How many GCC optimization levels are there?
To be pedantic, there are 8 different valid -O options you can give to gcc, though there are some that mean the same thing.
The original version of this answer stated there were 7 options. GCC has since added -Og
to bring the total to 8.
From the man page:
-O
(Same as-O1
)-O0
(do no optimization, the default if no optimization level is specified)-O1
(optimize minimally)-O2
(optimize more)-O3
(optimize even more)-Ofast
(optimize very aggressively to the point of breaking standard compliance)-Og
(Optimize debugging experience. -Og enables optimizations that do not interfere with debugging. It should be the
optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization
while maintaining fast compilation and a good debugging experience.)-Os
(Optimize for size.-Os
enables all-O2
optimizations that do not typically increase code size. It also performs further optimizations
designed to reduce code size.-Os
disables the following optimization flags:-falign-functions -falign-jumps -falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition -fprefetch-loop-arrays -ftree-vect-loop-version
)
There may also be platform specific optimizations, as @pauldoo notes, OS X has -Oz
.
What is the difference between User Defined SWIFT_WHOLE_MODULE_OPTIMIZATION and Swift Optimization Level?
- The main difference is that
whole-module-optimization
refers to how the compiler optimises modules and theSwift Optimization Level
refers to each file compilation. You can read more about the different flags forSwift Optimization Level
here. SWIFT_WHOLE_MODULE_OPTIMIZATION
improves compilation time because the compiler has a more global view on all functions, methods and relations between files, allowing it to ignore unused functions, optimise compile order, among other improvements. It also focuses on compiling only modified files, which means that even with that flag activated, if you clean your project and delete the derived data folder, you will still have a bigger compilation time during the first run.
Clang optimization levels
I found this related question.
To sum it up, to find out about compiler optimization passes:
llvm-as < /dev/null | opt -O3 -disable-output -debug-pass=Arguments
As pointed out in Geoff Nixon's answer (+1), clang
additionally runs some higher level optimizations, which we can retrieve with:
echo 'int;' | clang -xc -O3 - -o /dev/null -\#\#\#
Documentation of individual passes is available here.
You can compare the effect of changing high-level flags such as -O
like this:
diff -wy --suppress-common-lines \
<(echo 'int;' | clang -xc - -o /dev/null -\#\#\# 2>&1 | tr " " "\n" | grep -v /tmp) \
<(echo 'int;' | clang -xc -O0 - -o /dev/null -\#\#\# 2>&1 | tr " " "\n" | grep -v /tmp)
# will tell you that -O0 is indeed the default.
With version 6.0 the passes are as follow:
baseline (
-O0
):opt
sets: -tti -verify -ee-instrument -targetlibinfo -assumption-cache-tracker -profile-summary-info -forceattrs -basiccg -always-inline -barrierclang
adds : -mdisable-fp-elim -mrelax-all
-O1
is based on-O0
opt
adds: -targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -called-value-propagation -globalopt -domtree -mem2reg -deadargelim -basicaa -aa -loops -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -always-inline -functionattrs -sroa -memoryssa -early-cse-memssa -speculative-execution -lazy-value-info -jump-threading -correlated-propagation -libcalls-shrinkwrap -branch-prob -block-freq -pgo-memop-opt -tailcallelim -reassociate -loop-simplify -lcssa-verification -lcssa -scalar-evolution -loop-rotate -licm -loop-unswitch -indvars -loop-idiom -loop-deletion -loop-unroll -memdep -memcpyopt -sccp -demanded-bits -bdce -dse -postdomtree -adce -barrier -rpo-functionattrs -globaldce -float2int -loop-accesses -loop-distribute -loop-vectorize -loop-load-elim -alignment-from-assumptions -strip-dead-prototypes -loop-sink -instsimplify -div-rem-pairs -verify -ee-instrument -early-cse -lower-expectclang
adds : -momit-leaf-frame-pointerclang
drops : -mdisable-fp-elim -mrelax-all
-O2
is based on-O1
opt
adds: -inline -mldst-motion -gvn -elim-avail-extern -slp-vectorizer -constmergeopt
drops: -always-inlineclang
adds: -vectorize-loops -vectorize-slp
-O3
is based on-O2
opt
adds: -callsite-splitting -argpromotion
-Ofast
is based on-O3
, valid inclang
but not inopt
clang
adds: -fno-signed-zeros -freciprocal-math -ffp-contract=fast -menable-unsafe-fp-math -menable-no-nans -menable-no-infs -mreassociate -fno-trapping-math -ffast-math -ffinite-math-only
-Os
is similar to-O2
opt
drops: -libcalls-shrinkwrap and -pgo-memopt-opt
-Oz
is based on-Os
opt
drops: -slp-vectorizer
With version 3.8 the passes are as follow:
baseline (
-O0
):opt
sets : -targetlibinfo -tti -verifyclang
adds : -mdisable-fp-elim -mrelax-all
-O1
is based on-O0
opt
adds: -globalopt -demanded-bits -branch-prob -inferattrs -ipsccp -dse -loop-simplify -scoped-noalias -barrier -adce -deadargelim -memdep -licm -globals-aa -rpo-functionattrs -basiccg -loop-idiom -forceattrs -mem2reg -simplifycfg -early-cse -instcombine -sccp -loop-unswitch -loop-vectorize -tailcallelim -functionattrs -loop-accesses -memcpyopt -loop-deletion -reassociate -strip-dead-prototypes -loops -basicaa -correlated-propagation -lcssa -domtree -always-inline -aa -block-freq -float2int -lower-expect -sroa -loop-unroll -alignment-from-assumptions -lazy-value-info -prune-eh -jump-threading -loop-rotate -indvars -bdce -scalar-evolution -tbaa -assumption-cache-trackerclang
adds : -momit-leaf-frame-pointerclang
drops : -mdisable-fp-elim -mrelax-all
-O2
is based on-O1
opt
adds: -elim-avail-extern -mldst-motion -slp-vectorizer -gvn -inline -globaldce -constmergeopt
drops: -always-inlineclang
adds: -vectorize-loops -vectorize-slp
-O3
is based on-O2
opt
adds: -argpromotion
-Ofast
is based on-O3
, valid inclang
but not inopt
clang
adds: -fno-signed-zeros -freciprocal-math -ffp-contract=fast -menable-unsafe-fp-math -menable-no-nans -menable-no-infs
-Os
is the same as-O2
-Oz
is based on-Os
opt
drops: -slp-vectorizerclang
drops: -vectorize-loops
----------
With version 3.7 the passes are as follow (parsed output of the command above):
default (-O0): -targetlibinfo -verify -tti
-O1 is based on -O0
- adds: -sccp -loop-simplify -float2int -lazy-value-info -correlated-propagation -bdce -lcssa -deadargelim -loop-unroll -loop-vectorize -barrier -memcpyopt -loop-accesses -assumption-cache-tracker -reassociate -loop-deletion -branch-prob -jump-threading -domtree -dse -loop-rotate -ipsccp -instcombine -scoped-noalias -licm -prune-eh -loop-unswitch -alignment-from-assumptions -early-cse -inline-cost -simplifycfg -strip-dead-prototypes -tbaa -sroa -no-aa -adce -functionattrs -lower-expect -basiccg -loops -loop-idiom -tailcallelim -basicaa -indvars -globalopt -block-freq -scalar-evolution -memdep -always-inline
-O2 is based on -01
- adds: -elim-avail-extern -globaldce -inline -constmerge -mldst-motion -gvn -slp-vectorizer
- removes: -always-inline
-O3 is based on -O2
- adds: -argpromotion -verif
-Os is identical to -O2
-Oz is based on -Os
- removes: -slp-vectorizer
----------
For version 3.6 the passes are as documented in GYUNGMIN KIM's post.
----------
With version 3.5 the passes are as follow (parsed output of the command above):
default (-O0): -targetlibinfo -verify -verify-di
-O1 is based on -O0
- adds: -correlated-propagation -basiccg -simplifycfg -no-aa -jump-threading -sroa -loop-unswitch -ipsccp -instcombine -memdep -memcpyopt -barrier -block-freq -loop-simplify -loop-vectorize -inline-cost -branch-prob -early-cse -lazy-value-info -loop-rotate -strip-dead-prototypes -loop-deletion -tbaa -prune-eh -indvars -loop-unroll -reassociate -loops -sccp -always-inline -basicaa -dse -globalopt -tailcallelim -functionattrs -deadargelim -notti -scalar-evolution -lower-expect -licm -loop-idiom -adce -domtree -lcssa
-O2 is based on -01
- adds: -gvn -constmerge -globaldce -slp-vectorizer -mldst-motion -inline
- removes: -always-inline
-O3 is based on -O2
- adds: -argpromotion
-Os is identical to -O2
-Oz is based on -Os
- removes: -slp-vectorizer
----------
With version 3.4 the passes are as follow (parsed output of the command above):
-O0: -targetlibinfo -preverify -domtree -verify
-O1 is based on -O0
- adds: -adce -always-inline -basicaa -basiccg -correlated-propagation -deadargelim -dse -early-cse -functionattrs -globalopt -indvars -inline-cost -instcombine -ipsccp -jump-threading -lazy-value-info -lcssa -licm -loop-deletion -loop-idiom -loop-rotate -loop-simplify -loop-unroll -loop-unswitch -loops -lower-expect -memcpyopt -memdep -no-aa -notti -prune-eh -reassociate -scalar-evolution -sccp -simplifycfg -sroa -strip-dead-prototypes -tailcallelim -tbaa
-O2 is based on -01
- adds: -barrier -constmerge -domtree -globaldce -gvn -inline -loop-vectorize -preverify -slp-vectorizer -targetlibinfo -verify
- removes: -always-inline
-O3 is based on -O2
- adds: -argpromotion
-Os is identical to -O2
-Oz is based on -O2
- removes: -barrier -loop-vectorize -slp-vectorizer
----------
With version 3.2 the passes are as follow (parsed output of the command above):
-O0: -targetlibinfo -preverify -domtree -verify
-O1 is based on -O0
- adds: -sroa -early-cse -lower-expect -no-aa -tbaa -basicaa -globalopt -ipsccp -deadargelim -instcombine -simplifycfg -basiccg -prune-eh -always-inline -functionattrs -simplify-libcalls -lazy-value-info -jump-threading -correlated-propagation -tailcallelim -reassociate -loops -loop-simplify -lcssa -loop-rotate -licm -loop-unswitch -scalar-evolution -indvars -loop-idiom -loop-deletion -loop-unroll -memdep -memcpyopt -sccp -dse -adce -strip-dead-prototypes
-O2 is based on -01
- adds: -inline -globaldce -constmerge
- removes: -always-inline
-O3 is based on -O2
- adds: -argpromotion
-Os is identical to -O2
-Oz is identical to -Os
-------------
Edit [march 2014] removed duplicates from lists.
Edit [april 2014] added documentation link + options for 3.4
Edit [september 2014] added options for 3.5
Edit [december 2015] added options for 3.7 and mention existing answer for 3.6
Edit [may 2016] added options for 3.8, for both opt and clang and mention existing answer for clang (versus opt)
Edit [nov 2018] add options for 6.0
Is optimisation level -O3 dangerous in g++?
In the early days of gcc (2.8 etc.) and in the times of egcs, and redhat 2.96 -O3 was quite buggy sometimes. But this is over a decade ago, and -O3 is not much different than other levels of optimizations (in buggyness).
It does however tend to reveal cases where people rely on undefined behavior, due to relying more strictly on the rules, and especially corner cases, of the language(s).
As a personal note, I am running production software in the financial sector for many years now with -O3 and have not yet encountered a bug that would not have been there if I would have used -O2.
By popular demand, here an addition:
-O3 and especially additional flags like -funroll-loops (not enabled by -O3) can sometimes lead to more machine code being generated. Under certain circumstances (e.g. on a cpu with exceptionally small L1 instruction cache) this can cause a slowdown due to all the code of e.g. some inner loop now not fitting anymore into L1I. Generally gcc tries quite hard to not to generate so much code, but since it usually optimizes the generic case, this can happen. Options especially prone to this (like loop unrolling) are normally not included in -O3 and are marked accordingly in the manpage. As such it is generally a good idea to use -O3 for generating fast code, and only fall back to -O2 or -Os (which tries to optimize for code size) when appropriate (e.g. when a profiler indicates L1I misses).
If you want to take optimization into the extreme, you can tweak in gcc via --param the costs associated with certain optimizations. Additionally note that gcc now has the ability to put attributes at functions that control optimization settings just for these functions, so when you find you have a problem with -O3 in one function (or want to try out special flags for just that function), you don't need to compile the whole file or even whole project with O2.
otoh it seems that care must be taken when using -Ofast, which states:
-Ofast enables all -O3 optimizations.
It also enables optimizations that are not valid for all standard
compliant programs.
which makes me conclude that -O3 is intended to be fully standards compliant.
Which GCC optimization flags affect binary size the most?
Most of the extra code-size for an un-optimized build is the fact that the default -O0
also means a debug build, not keeping anything in registers across statements for consistent debugging even if you use a GDB j
command to jump to a different source line in the same function. -O0
means a huge amount of store/reload vs. even the lightest level of optimization, especially disastrous for code-size on a non-CISC ISA that can't use memory source operands. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to GCC equally.
Especially for modern C++, a debug build is disastrous because simple template wrapper functions that normally inline and optimize away to nothing in simple cases (or maybe one instruction), instead compile to actual function calls that have to set up args and run a call instruction. e.g. for a std::vector
, the operator[]
member function can normally inline to a single ldr
instruction, assuming the compiler has the .data()
pointer in a register. But without inlining, every call-site takes multiple instructions1
Options that affect code-size in the actual .text
section1 the most: alignment of branch-targets in general, or just loops, costs some code-size. Other than that:
-ftree-vectorize
- make SIMD versions loops, also necessitating scalar cleanup if the compiler can't prove that the iteration count will be a multiple of the vector width. (Or that pointed-to arrays are non-overlapping if you don't userestrict
; that may also need a scalar fallback). Enabled at-O3
in GCC11 and earlier. Enabled at-O2
in GCC12 and later, like clang.-funroll-loops
/-funroll-all-loops
- not enabled by default even at-O3
in modern GCC. Enabled with profile-guided optimization (-fprofile-use
), when it has profiling data from a-fprofile-generate
build to know which loops are actually hot and worth spending code-size on. (And which are cold and thus should be optimized for size so you get fewer I-cache misses when they do run, and less eviction of other code.) PGO also influences vectorization decisions.Related to loop unrolling are heuristics (tuning knobs) that control loop peeling (fully unrolling) and how much to unroll. The normal way to set these is with
-march=native
, implying-mtune=
whatever as well.-mtune=znver3
may favour big unroll factors (at least clang does), compared to-mtune=sandybridge
or-mtune=haswell
. But there are GCC options to manually adjust individual things, as discussed in comments on gcc: strange asm generated for simple loop and in How to ask GCC to completely unroll this loop (i.e., peel this loop)?
There are options to override the weights and thresholds for other decision heuristics like inlining, too, but it's very rare you'd want to fine-tune that much unless you're working on refining the defaults, or finding good defaults for a new CPU.-Os
- optimize for size and speed, trying not to sacrifice too much speed. A good tradeoff if your code has a lot of I-cache misses, otherwise-O3
is normally faster, or at least that's the design goal for GCC. Can be worth trying different options to see if-O2
or-Os
make your code faster than-O3
across some CPUs you care about; sometimes missed-optimizations or quirks of certain microarchitectures make a difference, as in Why does GCC generate 15-20% faster code if I optimize for size instead of speed? which has actual benchmarks from GCC4.6 to 4.8 (current at the time) for a specific small loop in a test program, on quite a few different x86 and ARM CPUs, with and without-march=native
to actually tune for them. There's zero reason to expect that to be representative of other code, though, so you need to test yourself for your own codebase. (And for any given loop, small code changes could make a different compile option better on any given CPU.)And obviously
-Os
is very useful if you need your static code-size smaller to fit in some size limit.-Oz
optimizing for size only, even at a large cost in speed. GCC only very recently added this to current trunk, so expect it in GCC12 or 13. Presumably what I wrote below about clang's implementation of-Oz
being quite aggressive also applies to GCC, but I haven't yet tested it.
Clang has similar options, including -Os
. It also has a clang -Oz
option to optimize only for size, without caring about speed. It's very aggressive, e.g. on x86 using code-golf tricks like push 1; pop rax
(3 bytes total) instead of mov eax, 1
(5 bytes).
GCC's -Os
unfortunately chooses to use div
instead of a multiplicative inverse for division by a constant, costing lots of speed but not saving much if any size. (https://godbolt.org/z/x9h4vx1YG for x86-64). For ARM, GCC -Os
still uses an inverse if you don't use a -mcpu=
that implies udiv
is even available, otherwise it uses udiv
: https://godbolt.org/z/f4sa9Wqcj .
Clang's -Os
still uses a multiplicative inverse with umull
, only using udiv
with -Oz
. (or a call to __aeabi_uidiv
helper function without any -mcpu
option). So in that respect, clang -Os
makes a better tradeoff than GCC, still spending a little bit of code-size to avoid slow integer division.
Footnote 1: inlining or not for std::vector
#include <vector>
int foo(std::vector<int> &v) {
return v[0] + v[1];
}
Godbolt with gcc
with the default -O0
vs. -Os
for -mcpu=cortex-m7
just to randomly pick something. IDK if it's normal to use dynamic containers like std::vector
on an actual microcontroller; probably not.
# -Os (same as -Og for this case, actually, omitting the frame pointer for this leaf function)
foo(std::vector<int, std::allocator<int> >&):
ldr r3, [r0] @ load the _M_start member of the reference arg
ldrd r0, r3, [r3] @ load a pair of words (v[0..1]) from there into r0 and r3
add r0, r0, r3 @ add them into the return-value register
bx lr
vs. a debug build (with name-demangling enabled for the asm)
# GCC -O0 -mcpu=cortex-m7 -mthumb
foo(std::vector<int, std::allocator<int> >&):
push {r4, r7, lr} @ non-leaf function requires saving LR (the return address) as well as some call-preserved registers
sub sp, sp, #12
add r7, sp, #0 @ Use r7 as a frame pointer. -O0 defaults to -fno-omit-frame-pointer
str r0, [r7, #4] @ spill the incoming register arg to the stack
movs r1, #0 @ 2nd arg for operator[]
ldr r0, [r7, #4] @ reload the pointer to the control block as the first arg
bl std::vector<int, std::allocator<int> >::operator[](unsigned int)
mov r3, r0 @ useless copy, but hey we told GCC not to spend any time optimizing.
ldr r4, [r3] @ deref the reference (pointer) it returned, into a call-preserved register that will survive across the next call
movs r1, #1 @ arg for the v[1] operator[]
ldr r0, [r7, #4]
bl std::vector<int, std::allocator<int> >::operator[](unsigned int)
mov r3, r0
ldr r3, [r3] @ deref the returned reference
add r3, r3, r4 @ v[1] + v[0]
mov r0, r3 @ and copy into the return value reg because GCC didn't bother to add into it directly
adds r7, r7, #12 @ tear down the stack frame
mov sp, r7
pop {r4, r7, pc} @ and return by popping saved-LR into PC
@ and there's an actual implementation of the operator[] function
@ it's 15 instructions long.
@ But only one instance of this is needed for each type your program uses (vector<int>, vector<char*>, vector<my_foo>, etc.)
@ so it doesn't add up as much as each call-site
std::vector<int, std::allocator<int> >::operator[](unsigned int):
push {r7}
sub sp, sp, #12
...
As you can see, un-optimized GCC cares more about fast compile-times than even the most simple things like avoiding useless mov reg,reg
instructions even within code for evaluating one expression.
Footnote 1: metadata
If you could a whole ELF executable with metadata, not just the .text + .rodata + .data you'd need to burn to flash, then of course -g
debug info is very significant for size of the file, but basically irrelevant because it's not mixed in with the parts that are needed while running, so it just sits there on disk.
Symbol names and debug info can be stripped with gcc -s
or strip
.
Stack-unwind info is an interesting tradeoff between code-size and metadata. -fno-omit-frame-pointer
wastes extra instructions and a register as a frame pointer, leading to larger machine-code size, but smaller .eh_frame
stack unwind metadata. (strip
does not consider that "debug" info by default, even for C programs not C++ where exception-handling might need it in non-debugging contexts.)
How to remove "noise" from GCC/clang assembly output? mentions how to get the compiler to omit some of that: -fno-asynchronous-unwind-tables
omits .cfi
directives in the asm output, and thus the metadata that goes into the .eh_frame
section. Also -fno-exceptions -fno-rtti
with C++ can reduce metadata. (Run-Time Type Information for reflection takes space.)
Linker options that control alignment of sections / ELF segments can also take extra space, relevant for tiny executables but is basically a constant amount of space, not scaling with the size of the program. See also Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?
Project Name' was compiled with optimization - stepping may behave oddly; variables may not be available
It's been a long time but I finally solved the issue. There is a third optimization flag LTO
or Link Time Optimization
and Surprisingly no one have
mentioned it here and for some reason I didn't pay attention to it either. It's right there above the Optimization Level
setting as you can see in many screen shots posted here.
So to summarize it there are 3 different optimization flags you want to turn off for debugging :
- LLVM Link Time Optimization (
-flto
) - LLVM Optimization Level (
-O
) - Swift Compiler Optimization Level
More information about LTO:
http://llvm.org/docs/LinkTimeOptimization.html
Compile with different optimzation levels for different parts of a code
In terms of correctness, there shouldn't be. It would be a compiler bug if an optimization level changed the ABI.
It's 100% normal to make test/debug builds at -O0
or -Og
and link against optimized libraries (including but not limited to system libraries like libc).
You only need -O3 -flto -march=native
etc. etc. when testing / optimizing performance.
Related Topics
No Exact Matches in Call to Instance Method Error Message in Swift
Notification in Swift Returning Userinfo in Dictionary
How to Print Http Request to Console
Uidatepicker 15 Minute Increments Swift
Tableview Image Content Selection Color
Set Multiple Arrow Directions on UIpopovercontroller in Swift
How to Find The Nth Root of a Value
How to Know If a Swiftui Button Is Enabled/Disabled
Swift Override Static Method Compile Error
How to Update UIviewrepresentable with Observableobject
How to Catch a Nsinternalinconsistencyexception in Swift
"This Class Is Not Key Value Coding-Compliant" Using Coreimage
What Was The Reason for Swift Assignment Evaluation to Void
How to Convert Unicode Character to Int in Swift
How to Byte Reverse Nsdata Output in Swift The Littleendian Way