Investigating the size of an extremely small C program
du
reports the disk space used by a file whereas ls
reports the actual size of a file. Typically the size reported by du
is significantly larger for small files.
You can significantly reduce the size of the binary by changing compile and linking options and stripping out unnecessary sections.
$ cat test.c
void _start() {
asm("movl $1,%eax;"
"xorl %ebx,%ebx;"
"int $0x80");
}
$ gcc -s -nostdlib test.c -o test
$ ./test
$ ls -l test
-rwxrwxr-x 1 fpm fpm 8840 Dec 9 04:09 test
$ readelf -W --section-headers test
There are 7 section headers, starting at offset 0x20c8:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .note.gnu.build-id NOTE 0000000000400190 000190 000024 00 A 0 0 4
[ 2] .text PROGBITS 0000000000401000 001000 000010 00 AX 0 0 1
[ 3] .eh_frame_hdr PROGBITS 0000000000402000 002000 000014 00 A 0 0 4
[ 4] .eh_frame PROGBITS 0000000000402018 002018 000038 00 A 0 0 8
[ 5] .comment PROGBITS 0000000000000000 002050 00002e 01 MS 0 0 1
[ 6] .shstrtab STRTAB 0000000000000000 00207e 000045 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
l (large), p (processor specific)
$
$ gcc -s -nostdlib -Wl,--nmagic test.c -o test
$ ls -l test
-rwxrwxr-x 1 fpm fpm 984 Dec 9 16:55 test
$ strip -R .comment -R .note.gnu.build-id test
$ strip -R .eh_frame_hdr -R .eh_frame test
$ ls -l test
-rwxrwxr-x 1 fpm fpm 520 Dec 9 17:03 test
$
Note that clang
can produce a significantly smaller binary than gcc
by default in this particular instance. However, after compiling with clang
and stripping unnecessary sections, the final size of the binary is 736 bytes, which is bigger than the 520 bytes possible with gcc -s -nostdlib -Wl,--nmagic test.c -o test
.
$ clang -static -nostdlib -flto -fuse-ld=lld -o test test.c
$ ls -l test
-rwxrwxr-x 1 fpm fpm 1344 Dec 9 04:15 test
$
$ readelf -W --section-headers test
There are 9 section headers, starting at offset 0x300:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .note.gnu.build-id NOTE 0000000000200190 000190 000018 00 A 0 0 4
[ 2] .eh_frame_hdr PROGBITS 00000000002001a8 0001a8 000014 00 A 0 0 4
[ 3] .eh_frame PROGBITS 00000000002001c0 0001c0 00003c 00 A 0 0 8
[ 4] .text PROGBITS 0000000000201200 000200 00000f 00 AX 0 0 16
[ 5] .comment PROGBITS 0000000000000000 00020f 000040 01 MS 0 0 1
[ 6] .symtab SYMTAB 0000000000000000 000250 000048 18 8 2 8
[ 7] .shstrtab STRTAB 0000000000000000 000298 000055 00 0 0 1
[ 8] .strtab STRTAB 0000000000000000 0002ed 000012 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
l (large), p (processor specific)
$
$ strip -R .eh_frame_hdr -R .eh_frame test
$ strip -R .comment -R .note.gnu.build-id test
strip: test: warning: empty loadable segment detected at vaddr=0x200000, is this intentional?
$ ls -l test
-rwxrwxr-x 1 fpm fpm 736 Dec 9 04:19 test
$ readelf -W --section-headers test
There are 3 section headers, starting at offset 0x220:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000201200 000200 00000f 00 AX 0 0 16
[ 2] .shstrtab STRTAB 0000000000000000 00020f 000011 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
l (large), p (processor specific)
$
.text
is your code, .shstrtab
is the Section Header String table. Each ElfHeader
structure contains an e_shstrndx
member which is an index into the .shstrtab
table. If you use this index, you can find the name of that section.
Why do my results different following along the tiny asm example?
-static
is not the default even with -nostdlib
when GCC is configured to make PIEs by default. Use gcc -m32 -static -nostdlib
to get the historical behaviour. (-static
implies -no-pie
). See What's the difference between "statically linked" and "not a dynamic executable" from Linux ldd? for more.
Also, you may need to disable alignment of other sections with gcc -Wl,--nmagic
or using a custom linker script, and maybe disable extra sections of metadata that GCC adds. Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?
You probably don't have a .eh_frame
section if you're not linking any compiler-generated (from C) .o
files. But if you were, you can disable that with gcc -fno-asynchronous-unwind-tables
. (See also How to remove "noise" from GCC/clang assembly output? for general tips aimed at looking at the compiler's asm text output, moreso than executable size.)
See also GCC + LD + NDISASM = huge amount of assembler instructions (ndisasm doesn't handle metadata at all, only flat binary, so it "disassembles" metadata. So the answer there includes info on how to avoid other sections.)
GCC -Wl,--build-id=none
will avoid including a .note.gnu.build-id
section in the executable.
$ nasm -felf32 foo.asm
$ gcc -m32 -static -nostdlib -Wl,--build-id=none -Wl,--nmagic foo.o
$ ll a.out
-rwxr-xr-x 1 peter peter 488 Dec 26 18:47 a.out
$ strip a.out
$ ll a.out
-rwxr-xr-x 1 peter peter 248 Dec 26 18:47 a.out
(Tested on x86-64 Arch GNU/Linux, NASM 2.15.05, gcc 10.2, ld
from GNU Binutils 2.35.1.)
You can check on the sections in your executable with readelf -a a.out
(or use a more specific option to only get part of readelf
's large output.) e.g. before stripping,
$ readelf -S unstripped_a.out
...
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 08048060 000060 00000c 00 AX 0 0 16
[ 2] .symtab SYMTAB 00000000 00006c 000070 10 3 3 4
[ 3] .strtab STRTAB 00000000 0000dc 000021 00 0 0 1
[ 4] .shstrtab STRTAB 00000000 0000fd 000021 00 0 0 1
And BTW, you definitely do not want to use nasm -felf64
on a file that uses BITS 32
, unless you're writing a kernel or something that switches from 64-bit long mode to 32-bit compat mode. Putting 32-bit machine code in a 64-bit object file is not helpful. Only ever use BITS
when you want raw binary mode to work (later in that tiny-ELF tutorial). When you're making a .o
to link, it only makes it possible to shoot yourself in the foot; don't do it. (Although it's not harmful if you do properly use nasm -felf32
that matches your BITS directive.)
gcc: passing -nostartfiles to ld via gcc for minimal binary size
-nostartfiles
is not an ld
option. It parses as ld -n -o startfiles
.
I tried your commands, and they don't create a file called go
, they create an executable called startfiles
.
$ cat > go.s
paste + control-d
$ as -o go.o go.s
$ ld -o go -s -nostartfiles go.o
$ ll go
ls: cannot access 'go': No such file or directory
$ ll -clrt
-rw-r--r-- 1 peter peter 193 May 13 11:33 go.s
-rw-r--r-- 1 peter peter 704 May 13 11:33 go.o
-rwxr-xr-x 1 peter peter 344 May 13 11:33 startfiles
Your go
must have been left over from your ld -s -nostartfiles -o go go.o
where -o go
was the last instance of -o
on the command line, not overridden by -o startfiles
.
The option that makes your binary small is ld -n
:
-n
--nmagic
Turn off page alignment of sections, and disable linking against shared libraries. If the output format supports Unix style magic numbers, mark the
output as "NMAGIC".
Related: Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?
As a GCC option, -nostartfiles
tells the gcc
front-end not to link crt*.o
files. If you're running ld
manually, you just omit mentioning them. There's no need to tell ld
what you're not linking, the ld
command itself doesn't link anything it's not explicitly told to. Linking CRT files and libc / libgcc are gcc
defaults, not ld
.
$ gcc -s -nostdlib -static -Wl,--nmagic,--build-id=none go.s
$ ll a.out
-rwxr-xr-x 1 peter peter 344 May 13 12:35 a.out
You want -nostdlib
to omit libraries as well as CRT start files.-nostartfiles
is only a subset of what -nostdlib
does.
(Although when statically linking, ld
doesn't pull in any code from libc.a
or libgcc.a
because your file doesn't reference any external symbols. So you actually still get the same 344 byte file from using -nostartfiles
as -nostdlib
. But -nostdlib
replicates your manual ld
command more exactly, not passing any extra files.)
You need -static
to not dynamically link, on a GCC where -pie
is the default. (This also implies -no-pie
; -static-pie
won't be enabled by default.)--nmagic
fails with error: PHDR segment not covered by LOAD segment if you let GCC try to dynamically link a PIE executable (even with no shared libraries).
Which GCC optimization flags affect binary size the most?
Most of the extra code-size for an un-optimized build is the fact that the default -O0
also means a debug build, not keeping anything in registers across statements for consistent debugging even if you use a GDB j
command to jump to a different source line in the same function. -O0
means a huge amount of store/reload vs. even the lightest level of optimization, especially disastrous for code-size on a non-CISC ISA that can't use memory source operands. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to GCC equally.
Especially for modern C++, a debug build is disastrous because simple template wrapper functions that normally inline and optimize away to nothing in simple cases (or maybe one instruction), instead compile to actual function calls that have to set up args and run a call instruction. e.g. for a std::vector
, the operator[]
member function can normally inline to a single ldr
instruction, assuming the compiler has the .data()
pointer in a register. But without inlining, every call-site takes multiple instructions1
Options that affect code-size in the actual .text
section1 the most: alignment of branch-targets in general, or just loops, costs some code-size. Other than that:
-ftree-vectorize
- make SIMD versions loops, also necessitating scalar cleanup if the compiler can't prove that the iteration count will be a multiple of the vector width. (Or that pointed-to arrays are non-overlapping if you don't userestrict
; that may also need a scalar fallback). Enabled at-O3
in GCC11 and earlier. Enabled at-O2
in GCC12 and later, like clang.-funroll-loops
/-funroll-all-loops
- not enabled by default even at-O3
in modern GCC. Enabled with profile-guided optimization (-fprofile-use
), when it has profiling data from a-fprofile-generate
build to know which loops are actually hot and worth spending code-size on. (And which are cold and thus should be optimized for size so you get fewer I-cache misses when they do run, and less eviction of other code.) PGO also influences vectorization decisions.Related to loop unrolling are heuristics (tuning knobs) that control loop peeling (fully unrolling) and how much to unroll. The normal way to set these is with
-march=native
, implying-mtune=
whatever as well.-mtune=znver3
may favour big unroll factors (at least clang does), compared to-mtune=sandybridge
or-mtune=haswell
. But there are GCC options to manually adjust individual things, as discussed in comments on gcc: strange asm generated for simple loop and in How to ask GCC to completely unroll this loop (i.e., peel this loop)?
There are options to override the weights and thresholds for other decision heuristics like inlining, too, but it's very rare you'd want to fine-tune that much unless you're working on refining the defaults, or finding good defaults for a new CPU.-Os
- optimize for size and speed, trying not to sacrifice too much speed. A good tradeoff if your code has a lot of I-cache misses, otherwise-O3
is normally faster, or at least that's the design goal for GCC. Can be worth trying different options to see if-O2
or-Os
make your code faster than-O3
across some CPUs you care about; sometimes missed-optimizations or quirks of certain microarchitectures make a difference, as in Why does GCC generate 15-20% faster code if I optimize for size instead of speed? which has actual benchmarks from GCC4.6 to 4.8 (current at the time) for a specific small loop in a test program, on quite a few different x86 and ARM CPUs, with and without-march=native
to actually tune for them. There's zero reason to expect that to be representative of other code, though, so you need to test yourself for your own codebase. (And for any given loop, small code changes could make a different compile option better on any given CPU.)And obviously
-Os
is very useful if you need your static code-size smaller to fit in some size limit.-Oz
optimizing for size only, even at a large cost in speed. GCC only very recently added this to current trunk, so expect it in GCC12 or 13. Presumably what I wrote below about clang's implementation of-Oz
being quite aggressive also applies to GCC, but I haven't yet tested it.
Clang has similar options, including -Os
. It also has a clang -Oz
option to optimize only for size, without caring about speed. It's very aggressive, e.g. on x86 using code-golf tricks like push 1; pop rax
(3 bytes total) instead of mov eax, 1
(5 bytes).
GCC's -Os
unfortunately chooses to use div
instead of a multiplicative inverse for division by a constant, costing lots of speed but not saving much if any size. (https://godbolt.org/z/x9h4vx1YG for x86-64). For ARM, GCC -Os
still uses an inverse if you don't use a -mcpu=
that implies udiv
is even available, otherwise it uses udiv
: https://godbolt.org/z/f4sa9Wqcj .
Clang's -Os
still uses a multiplicative inverse with umull
, only using udiv
with -Oz
. (or a call to __aeabi_uidiv
helper function without any -mcpu
option). So in that respect, clang -Os
makes a better tradeoff than GCC, still spending a little bit of code-size to avoid slow integer division.
Footnote 1: inlining or not for std::vector
#include <vector>
int foo(std::vector<int> &v) {
return v[0] + v[1];
}
Godbolt with gcc
with the default -O0
vs. -Os
for -mcpu=cortex-m7
just to randomly pick something. IDK if it's normal to use dynamic containers like std::vector
on an actual microcontroller; probably not.
# -Os (same as -Og for this case, actually, omitting the frame pointer for this leaf function)
foo(std::vector<int, std::allocator<int> >&):
ldr r3, [r0] @ load the _M_start member of the reference arg
ldrd r0, r3, [r3] @ load a pair of words (v[0..1]) from there into r0 and r3
add r0, r0, r3 @ add them into the return-value register
bx lr
vs. a debug build (with name-demangling enabled for the asm)
# GCC -O0 -mcpu=cortex-m7 -mthumb
foo(std::vector<int, std::allocator<int> >&):
push {r4, r7, lr} @ non-leaf function requires saving LR (the return address) as well as some call-preserved registers
sub sp, sp, #12
add r7, sp, #0 @ Use r7 as a frame pointer. -O0 defaults to -fno-omit-frame-pointer
str r0, [r7, #4] @ spill the incoming register arg to the stack
movs r1, #0 @ 2nd arg for operator[]
ldr r0, [r7, #4] @ reload the pointer to the control block as the first arg
bl std::vector<int, std::allocator<int> >::operator[](unsigned int)
mov r3, r0 @ useless copy, but hey we told GCC not to spend any time optimizing.
ldr r4, [r3] @ deref the reference (pointer) it returned, into a call-preserved register that will survive across the next call
movs r1, #1 @ arg for the v[1] operator[]
ldr r0, [r7, #4]
bl std::vector<int, std::allocator<int> >::operator[](unsigned int)
mov r3, r0
ldr r3, [r3] @ deref the returned reference
add r3, r3, r4 @ v[1] + v[0]
mov r0, r3 @ and copy into the return value reg because GCC didn't bother to add into it directly
adds r7, r7, #12 @ tear down the stack frame
mov sp, r7
pop {r4, r7, pc} @ and return by popping saved-LR into PC
@ and there's an actual implementation of the operator[] function
@ it's 15 instructions long.
@ But only one instance of this is needed for each type your program uses (vector<int>, vector<char*>, vector<my_foo>, etc.)
@ so it doesn't add up as much as each call-site
std::vector<int, std::allocator<int> >::operator[](unsigned int):
push {r7}
sub sp, sp, #12
...
As you can see, un-optimized GCC cares more about fast compile-times than even the most simple things like avoiding useless mov reg,reg
instructions even within code for evaluating one expression.
Footnote 1: metadata
If you could a whole ELF executable with metadata, not just the .text + .rodata + .data you'd need to burn to flash, then of course -g
debug info is very significant for size of the file, but basically irrelevant because it's not mixed in with the parts that are needed while running, so it just sits there on disk.
Symbol names and debug info can be stripped with gcc -s
or strip
.
Stack-unwind info is an interesting tradeoff between code-size and metadata. -fno-omit-frame-pointer
wastes extra instructions and a register as a frame pointer, leading to larger machine-code size, but smaller .eh_frame
stack unwind metadata. (strip
does not consider that "debug" info by default, even for C programs not C++ where exception-handling might need it in non-debugging contexts.)
How to remove "noise" from GCC/clang assembly output? mentions how to get the compiler to omit some of that: -fno-asynchronous-unwind-tables
omits .cfi
directives in the asm output, and thus the metadata that goes into the .eh_frame
section. Also -fno-exceptions -fno-rtti
with C++ can reduce metadata. (Run-Time Type Information for reflection takes space.)
Linker options that control alignment of sections / ELF segments can also take extra space, relevant for tiny executables but is basically a constant amount of space, not scaling with the size of the program. See also Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?
Related Topics
How to Change the Root Directory of an Apache Server
What's the Magic of "-" (A Dash) in Command-Line Parameters
What Happens If You Use the 32-Bit Int 0X80 Linux Abi in 64-Bit Code
Multiple Glibc Libraries on a Single Host
How to Get Overall Cpu Usage (E.G. 57%) on Linux
What Is Rss and Vsz in Linux Memory Management
Can't Call C Standard Library Function on 64-Bit Linux from Assembly (Yasm) Code
How to Merge Two "Ar" Static Libraries into One
How to Store a Command in a Variable in a Shell Script
How to Grep For Contents After Pattern
How to Get Cron to Call in the Correct Paths
How to Change the Environment Variables of Another Process in Unix
How to Remove the Lines Which Appear on File B from Another File A
How to Compare Two Strings in Dot Separated Version Format in Bash
How to Redirect Output to a File and Stdout
When Should I Wrap Quotes Around a Shell Variable