How to Decrease the Size of Generated Binaries

How to decrease the size of generated binaries?

Apart from the obvious (-Os -s), aligning functions to the smallest possible value that will not crash (I don't know ARM alignment requirements) might squeeze out a few bytes per function.

-Os should already disable aligning functions, but this might still default to a value like 4 or 8. If aligning e.g. to 1 is possible with ARM, that might save some bytes.

-ffast-math (or the less abrasive -fno-math-errno) will not set errno and avoid some checks, which reduces code size. If, like most people, you don't read errno anyway, that's an option.

Properly using __restrict (or restrict) and const removes redundant loads, making code both faster and smaller (and more correct). Properly marking pure functions as such eleminates function calls.

Enabling LTO may help, and if that is not available, compiling all source files into a binary in one go (gcc foo.c bar.c baz.c -o program instead of compiling foo.c, bar.c, and baz.c to object files first and then linking) will have a similar effect. It makes everything visible to the optimizer at one time, possibly allowing it to work better.

-fdelete-null-pointer-checks may be an option (note that this is normally enabled with any "O", but not on embedded targets).

Putting static globals (you hopefully don't have that many, but still) into a struct can eleminate a lot of overhead initializing them. I learned that when writing my first OpenGL loader. Having all the function pointers in a struct and initializing the struct with = {} generates one call to memset, whereas initializing the pointers the "normal way" generates a hundred kilobytes of code just to set each one to zero individually.

Avoid non-trivial-constructor static local variables like the devil (POD types are no problem). Gcc will initialize non-trivial-constructor static locals threadsafe unless you compile with -fno-threadsafe-statics, which links in a lot of extra code (even if you don't use threads at all).

Using something like libowfat instead of the normal crt can greatly reduce your binary size.

How to reduce the size of the executable?

As with any other library there is a fixed cost and a per-call cost. The fixed cost for the {fmt} library is indeed around 100-150k without debug info (it depends on the compiler flags). In your example you are comparing this fixed cost of linking with the library and the reason why iostreams appears to be smaller is because it is included in the standard library itself which is linked dynamically and not counted to the binary size of the executable.

Note that a large part of this size comes from floating-point formatting functionality which doesn't even exist in iostreams (shortest round-trip representation).

If you want to compare per-call binary size which is more important for real-world code with large number of formatting function calls, you can look at object files or generated assembly. For example:

#include <fmt/core.h>

int main() {
fmt::print("Oh hi!");
}

generates (https://godbolt.org/z/qWTKEMqoG)

.LC0:
.string "Oh hi!"
main:
sub rsp, 24
pxor xmm0, xmm0
xor edx, edx
mov edi, OFFSET FLAT:.LC0
mov rcx, rsp
mov esi, 6
movaps XMMWORD PTR [rsp], xmm0
call fmt::v8::vprint(fmt::v8::basic_string_view<char>, fmt::v8::basic_format_args<fmt::v8::basic_format_context<fmt::v8::appender, char> >)
xor eax, eax
add rsp, 24
ret

while

#include <iostream>

int main() {
std::cout << "Oh hi!";
}

generates (https://godbolt.org/z/frarWvzhP)

.LC0:
.string "Oh hi!"
main:
sub rsp, 8
mov edx, 6
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:_ZSt4cout
call std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
xor eax, eax
add rsp, 8
ret
_GLOBAL__sub_I_main:
sub rsp, 8
mov edi, OFFSET FLAT:_ZStL8__ioinit
call std::ios_base::Init::Init() [complete object constructor]
mov edx, OFFSET FLAT:__dso_handle
mov esi, OFFSET FLAT:_ZStL8__ioinit
mov edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev
add rsp, 8
jmp __cxa_atexit

Other than static initialization for cout there is not much difference because there is virtually no formatting here, so it's just one function call in both cases. Once you add formatting you'll quickly see the benefits of {fmt}, see e.g. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0645r10.html#BinaryCode.

How to reduce compiled file size?

Note: This answer is outdated

Please note that this answer is outdated. Please refer to the other higher-voted answers. I would like to delete this post, accepted answers can't be deleted though.


Is it a problem that the file is larger? I don't know Go but I would assume that it statically links some runtime lib which is not the case for the C program. But probably that is nothing to worry about as soon as your program gets larger.

As described here, statically linking the Go runtime is the default. That page also tells you how to set up for dynamic linking.

Process for reducing the size of an executable

General list:

  • Make sure that you have the compiler and linker debug options disabled
  • Compile and link with all size options turned on (-Os in gcc)
  • Run strip on the executable
  • Generate a map file and check your function sizes. You can either get your linker to generate your map file (-M when using ld), or you can use objdump on the final executable (note that this will only work on an unstripped executable!) This won't actually fix the problem, but it will let you know of the worst offenders.
  • Use nm to investigate the symbols that are called from each of your object files. This should help in finding who's calling functions that you don't want called.

In the original question was a sub-question about including only relevant functions. gcc will include all functions within every object file that is used. To put that another way, if you have an object file that contains 10 functions, all 10 functions are included in your executable even if one 1 is actually called.

The standard libraries (eg. libc) will split functions into many separate object files, which are then archived. The executable is then linked against the archive.
By splitting into many object files the linker is able to include only the functions that are actually called. (this assumes that you're statically linking)

There is no reason why you can't do the same trick. Of course, you could argue that if the functions aren't called the you can probably remove them yourself.

If you're statically linking against other libraries you can run the tools listed above over them too to make sure that they're following similar rules.

g++ compiler flag to minimize binary size

There are lots of techniques to reduce binary size in addition to what us2012 and others mentioned in the comments, summing them up with some points of my own:

  • Use -Os to make gcc/g++ optimize for size.
  • Use -ffunction-sections -fdata-sections to separate each function or data into distinct sections within the translation unit. Combine it with the linker option -Wl,--gc-sections to get rid of any unreferenced sections.
  • Run strip with at least the following options: -s -R .comment -R .gnu.version. It can be combined with --strip-unneeded to remove all symbols that are not necessary for relocation processing.

How to reduce the size of executable produced by MinGW g++ compiler?

Flags to use:

  • -s like you've been doing to strip symbols
  • -lstdc++_s to specify dynamically linking against the libstdc++.dll
  • -Os to optimize the binary for size.

By default mingw static links to libstdc++.a on Windows.

Note that the lstdc++_s flag is only in MinGW with GCC > 4.4, I believe.

How to reduce the TensorFlow Lite binary size with only the operators needed

Tensorflow Lite

If you are using Tensorflow Lite, the only solution I have found is to work at level of Interpreter and customize the Kernel Library (OpResolver). I don't think there is an automatic way of doing this, and the available only example (here the header) is not so easy to understand IMHO. I think that more improvements on this topic will be included in the next releases. Also, I'm not sure this will reduce the size of the final library. In the API notes this approach is considered equivalent to the selective registration, that is explained in the next part of the answer for Tensorflow Mobile.

Tensorflow Mobile

As an answer to the question "How can I enable only the ops used by my model", the answer is in Tensorflow Mobile Documentation (at the subsection Binary Size).

The usual size for Tensorflow Mobile seems to be 12MB, but it is possible to reduce it by including only the model required ops. Obviously this requires to build Tensorflow Lite as a Framework using Bazel.

You can create an header of required ops (ops_to_register.h) using the tool print_selective_registration_header.py, that is available here. The generated header should be placed in the root of the Tensorflow source directory.
You are now ready to compile the library, passing the SELECTIVE_REGISTRATION definition to the compiler (building with Bazel, you should add the option: --copts=”-DSELECTIVE_REGISTRATION”).

I think this procedure will give the library with minimal ops inside. Some other compiler optimization flags may help you with the size (sometimes penalizing performance).

Compile options

I actually don't know how you are compiling your code (static lib or dynamic lib), which are your needs in terms of performance, and which are the default options in Tensorflow bazelfile, but you may try:

  • to reduce the optimization to -O1 or -Os (sometimes helps with the binary size, and I think the default for Tensorflow is -O2 for the framework and -O3 for the single kernels, I don't know for the lite version though).
  • use the flags -fdata-section and --gc-sections: quoting gcc documentation: "[-fdata-sections] Together with a linker garbage collection (linker --gc-sections option) these options may lead to smaller statically-linked executables (after stripping)." (It seems that at least --gc-sections is used in linker options for Raspberry Pi)
  • -fvisibility-inlines-hidden should impact on performance of inline functions, but decreases the size of the export table of the shared object. This option may break the library. Some explanations can be read here.
  • Even more dangerous is -fvisibility=hidden. Look at it here.

What modifications will lead to size reduction of binary size in C++ code

The following methods are commonly used to reduce the size of programs,

  1. Use your compiler specific Techniques to reduce the size.

  2. Compile using gcc -S program.c to get the Assembler file. You can now perform assembler based space optimizations.

  3. Reduce the number of global variables in C.

  4. Instead of complex algorithms which gives you very small changes in the execution time, use simple algorithms. For example use bubble sort instead of Merge sort if the number of elements in the list is not very large.

  5. Remove simple functions which are used just once or twice.

  6. Eliminate dead code. Many often in large projects there are some.

  7. Be careful about the library functions you include in your program.



Related Topics



Leave a reply



Submit