Effects of Removing All Symbol Table and Relocation Information from an Executable

Effects of removing all symbol table and relocation information from an executable?

It seems pretty clear that removing relocation information would interfere with ASLR.

However, I've taken a look at man strip on a couple of my systems, and none of them suggest that strip does (or indeed can?) remove relocation information. It is mainly about removing debugging symbols.

What is the difference between gcc -s and a strip command?

gcc being a compiler/linker, its -s option is something done while linking. It's also not configurable - it has a set of information which it removes, no more no less.

strip is something which can be run on an object file which is already compiled. It also has a variety of command-line options which you can use to configure which information will be removed. For example, -g strips only the debug information which gcc -g adds.

Note that strip is not a bash command, though you may be running it from a bash shell. It is a command totally separate from bash, part of the GNU binary utilities suite.

How to remove unused C/C++ symbols with GCC and ld?

For GCC, this is accomplished in two stages:

First compile the data but tell the compiler to separate the code into separate sections within the translation unit. This will be done for functions, classes, and external variables by using the following two compiler flags:

-fdata-sections -ffunction-sections

Link the translation units together using the linker optimization flag (this causes the linker to discard unreferenced sections):

-Wl,--gc-sections

So if you had one file called test.cpp that had two functions declared in it, but one of them was unused, you could omit the unused one with the following command to gcc(g++):

gcc -Os -fdata-sections -ffunction-sections test.cpp -o test -Wl,--gc-sections

(Note that -Os is an additional compiler flag that tells GCC to optimize for size)

How to reduce the size of the executable?

As with any other library there is a fixed cost and a per-call cost. The fixed cost for the {fmt} library is indeed around 100-150k without debug info (it depends on the compiler flags). In your example you are comparing this fixed cost of linking with the library and the reason why iostreams appears to be smaller is because it is included in the standard library itself which is linked dynamically and not counted to the binary size of the executable.

Note that a large part of this size comes from floating-point formatting functionality which doesn't even exist in iostreams (shortest round-trip representation).

If you want to compare per-call binary size which is more important for real-world code with large number of formatting function calls, you can look at object files or generated assembly. For example:

#include <fmt/core.h>

int main() {
fmt::print("Oh hi!");
}

generates (https://godbolt.org/z/qWTKEMqoG)

.LC0:
.string "Oh hi!"
main:
sub rsp, 24
pxor xmm0, xmm0
xor edx, edx
mov edi, OFFSET FLAT:.LC0
mov rcx, rsp
mov esi, 6
movaps XMMWORD PTR [rsp], xmm0
call fmt::v8::vprint(fmt::v8::basic_string_view<char>, fmt::v8::basic_format_args<fmt::v8::basic_format_context<fmt::v8::appender, char> >)
xor eax, eax
add rsp, 24
ret

while

#include <iostream>

int main() {
std::cout << "Oh hi!";
}

generates (https://godbolt.org/z/frarWvzhP)

.LC0:
.string "Oh hi!"
main:
sub rsp, 8
mov edx, 6
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:_ZSt4cout
call std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
xor eax, eax
add rsp, 8
ret
_GLOBAL__sub_I_main:
sub rsp, 8
mov edi, OFFSET FLAT:_ZStL8__ioinit
call std::ios_base::Init::Init() [complete object constructor]
mov edx, OFFSET FLAT:__dso_handle
mov esi, OFFSET FLAT:_ZStL8__ioinit
mov edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev
add rsp, 8
jmp __cxa_atexit

Other than static initialization for cout there is not much difference because there is virtually no formatting here, so it's just one function call in both cases. Once you add formatting you'll quickly see the benefits of {fmt}, see e.g. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0645r10.html#BinaryCode.

does the dynamic linker modify the reference after executable is copied to memory?

Fairly close. The dynamic linker will modify something in the data segment, not specifically the .data section - segments are a coarser-grained thing corresponding to how the file is mapped into memory rather than the original semantic breakdown. The actual section is usually called .got or .got.plt but may vary by platform. The modification is not "relocating it to the instruction address" but resolving a relocation reference to the function name to get the address it was loaded at, and filling that address in.

Why use the Global Offset Table for symbols defined in the shared library itself?

The Global Offset Table serves two purposes. One is to allow the dynamic linker "interpose" a different definition of the variable from the executable or other shared object. The second is to allow position independent code to be generated for references to variables on certain processor architectures.

ELF dynamic linking treats the entire process, the executable and all of the shared objects (dynamic libraries), as sharing one single global namespace. If multiple components (executable or shared objects) define the same global symbol then the dynamic linker normally chooses one definition of that symbol and all references to that symbol in all components refer to that one definition. (However, the ELF dynamic symbol resolution is complex and for various reasons different components can end up using different definitions of the the same global symbol.)

To implement this, when building a shared library the compiler will access global variables indirectly through the GOT. For each variable an entry in the GOT will be created containing a pointer to the variable. As your example code shows, the compiler will then use this entry to obtain the address of variable instead of trying to access it directly. When the shared object is loaded into a process the dynamic linker will determine whether any of the global variables have been superseded by variable definitions in another component. If so those global variables will have their GOT entries updated to point at the superseding variable.

By using the "hidden" or "protected" ELF visibility attributes it's possible to prevent global defined symbol from being superseded by a definition in another component, and thus removing the need to use the GOT on certain architectures. For example:

extern int global_visible;
extern int global_hidden __attribute__((visibility("hidden")));
static volatile int local; // volatile, so it's not optimized away

int
foo() {
return global_visible + global_hidden + local;
}

when compiled with -O3 -fPIC with the x86_64 port of GCC generates:

foo():
mov rcx, QWORD PTR global_visible@GOTPCREL[rip]
mov edx, DWORD PTR local[rip]
mov eax, DWORD PTR global_hidden[rip]
add eax, DWORD PTR [rcx]
add eax, edx
ret

As you can see, only global_visible uses the GOT, global_hidden and local don't use it. The "protected" visibility works similarly, it prevents the definition from being superseded but makes it still visible to the dynamic linker so it can be accessed by other components. The "hidden" visibility hides the symbol completely from the dynamic linker.

The necessity of making code relocatable in order allow shared objects to be loaded a different addresses in different process means that statically allocated variables, whether they have global or local scope, can't be accessed with directly with a single instruction on most architectures. The only exception I know of is the 64-bit x86 architecture, as you see above. It supports memory operands that are both PC-relative and have large 32-bit displacements that can reach any variable defined in the same component.

On all the other architectures I'm familiar with accessing variables in position dependent manner requires multiple instructions. How exactly varies greatly by architecture, but it often involves using the GOT. For example, if you compile the example C code above with x86_64 port of GCC using the -m32 -O3 -fPIC options you get:

foo():
call __x86.get_pc_thunk.dx
add edx, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_
push ebx
mov ebx, DWORD PTR global_visible@GOT[edx]
mov ecx, DWORD PTR local@GOTOFF[edx]
mov eax, DWORD PTR global_hidden@GOTOFF[edx]
add eax, DWORD PTR [ebx]
pop ebx
add eax, ecx
ret
__x86.get_pc_thunk.dx:
mov edx, DWORD PTR [esp]
ret

The GOT is used for all three variable accesses, but if you look closely global_hidden and local are handled differently than global_visible. With the later, a pointer to the variable is accessed through the GOT, with former two variables they're accessed directly through the GOT. This a fairly common trick among architectures where the GOT is used for all position independent variable references.

The 32-bit x86 architecture is exceptional in one way here, since it has large 32-bit displacements and a 32-bit address space. This means that anywhere in memory can be accessed through the GOT base, not just the GOT itself. Most other architectures only support much smaller displacements, which makes the maximum distance something can be from the GOT base much smaller. Other architectures that use this trick will only put small (local/hidden/protected) variables in the GOT itself, large variables are stored outside the GOT and the GOT will contain a pointer to the variable just like with normal visibility global variables.

Investigating the size of an extremely small C program

du reports the disk space used by a file whereas ls reports the actual size of a file. Typically the size reported by du is significantly larger for small files.

You can significantly reduce the size of the binary by changing compile and linking options and stripping out unnecessary sections.

$ cat test.c
void _start() {
asm("movl $1,%eax;"
"xorl %ebx,%ebx;"
"int $0x80");
}

$ gcc -s -nostdlib test.c -o test
$ ./test
$ ls -l test
-rwxrwxr-x 1 fpm fpm 8840 Dec 9 04:09 test

$ readelf -W --section-headers test
There are 7 section headers, starting at offset 0x20c8:

Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .note.gnu.build-id NOTE 0000000000400190 000190 000024 00 A 0 0 4
[ 2] .text PROGBITS 0000000000401000 001000 000010 00 AX 0 0 1
[ 3] .eh_frame_hdr PROGBITS 0000000000402000 002000 000014 00 A 0 0 4
[ 4] .eh_frame PROGBITS 0000000000402018 002018 000038 00 A 0 0 8
[ 5] .comment PROGBITS 0000000000000000 002050 00002e 01 MS 0 0 1
[ 6] .shstrtab STRTAB 0000000000000000 00207e 000045 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
l (large), p (processor specific)
$

$ gcc -s -nostdlib -Wl,--nmagic test.c -o test
$ ls -l test
-rwxrwxr-x 1 fpm fpm 984 Dec 9 16:55 test
$ strip -R .comment -R .note.gnu.build-id test
$ strip -R .eh_frame_hdr -R .eh_frame test
$ ls -l test
-rwxrwxr-x 1 fpm fpm 520 Dec 9 17:03 test
$

Note that clang can produce a significantly smaller binary than gcc by default in this particular instance. However, after compiling with clang and stripping unnecessary sections, the final size of the binary is 736 bytes, which is bigger than the 520 bytes possible with gcc -s -nostdlib -Wl,--nmagic test.c -o test.

$ clang -static -nostdlib -flto -fuse-ld=lld -o test test.c
$ ls -l test
-rwxrwxr-x 1 fpm fpm 1344 Dec 9 04:15 test
$

$ readelf -W --section-headers test
There are 9 section headers, starting at offset 0x300:

Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .note.gnu.build-id NOTE 0000000000200190 000190 000018 00 A 0 0 4
[ 2] .eh_frame_hdr PROGBITS 00000000002001a8 0001a8 000014 00 A 0 0 4
[ 3] .eh_frame PROGBITS 00000000002001c0 0001c0 00003c 00 A 0 0 8
[ 4] .text PROGBITS 0000000000201200 000200 00000f 00 AX 0 0 16
[ 5] .comment PROGBITS 0000000000000000 00020f 000040 01 MS 0 0 1
[ 6] .symtab SYMTAB 0000000000000000 000250 000048 18 8 2 8
[ 7] .shstrtab STRTAB 0000000000000000 000298 000055 00 0 0 1
[ 8] .strtab STRTAB 0000000000000000 0002ed 000012 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
l (large), p (processor specific)
$

$ strip -R .eh_frame_hdr -R .eh_frame test
$ strip -R .comment -R .note.gnu.build-id test
strip: test: warning: empty loadable segment detected at vaddr=0x200000, is this intentional?
$ ls -l test
-rwxrwxr-x 1 fpm fpm 736 Dec 9 04:19 test
$ readelf -W --section-headers test
There are 3 section headers, starting at offset 0x220:

Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000201200 000200 00000f 00 AX 0 0 16
[ 2] .shstrtab STRTAB 0000000000000000 00020f 000011 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
l (large), p (processor specific)
$

.text is your code, .shstrtab is the Section Header String table. Each ElfHeader structure contains an e_shstrndx member which is an index into the .shstrtab table. If you use this index, you can find the name of that section.



Related Topics



Leave a reply



Submit