Why Are There Global Offset Tables and Procedure Linkage Tables in Statically Linked Executables

Why are there global offset tables and procedure linkage tables in statically linked executables?

I don't understand why a statically linked executable needs a PLT and GOT.

It doesn't.

I compiled a hello world program on my ubuntu x86_64 machine and when I dump the section headers with readelf -S it shows PLT and GOT sections.

This is an accident of implementation. The sections come from crt1.o, and there isn't a separate crt1s.o for fully-static linking, so you end up with .plt and .got entries from there.

You can strip these sections, and the binary will still work:

objcopy -R.got -R.plt a.out a.out2

Note: do not strip .rela.plt, as that section is still needed to implement IFUNCs.

Why does a fully static Rust ELF binary have a Global Offset Table (GOT) section?

TL;DR summary: the GOT is really a rudimentary build artifact, which I was able to get rid of via simple machine code manipulations.

Breakdown

If we look at

$ objdump -dj .text hello

and search for GLOBAL, we see only four distinct types of references to the GOT (constants differ):

  40037c:       ff 15 26 7a 23 00       call   QWORD PTR [rip+0x237a26]        # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
  425903:       ff 25 5f 26 21 00       jmp    QWORD PTR [rip+0x21265f]        # 637f68 <_GLOBAL_OFFSET_TABLE_+0x410>
  41d8b5:       48 3b 1d b4 a5 21 00    cmp    rbx,QWORD PTR [rip+0x21a5b4]    # 637e70 <_GLOBAL_OFFSET_TABLE_+0x318>
  40b259:       48 83 3d 7f cb 22 00    cmp    QWORD PTR [rip+0x22cb7f],0x0    # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
  40b260:       00

All of these are reading instructions, which means that the GOT is not modified at runtime. This in turn means that we can statically resolve the addresses that the GOT refers to! Let's consider the reference types one by one:

call QWORD PTR [rip+0x2126be] simply says "go to address [rip+0x2126be], take 8 bytes from there, interpret them as a function address and call the function". We can simply replace this instruction with a direct call:

  40037c:       e8 cf 3f 00 00          call   404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>
  400381:       90                      nop

Notice the nop at the end: we need to replace all the 6 bytes of the machine code that constitute the first instruction, but the instruction we replace it with is only 5 bytes, so we need to pad it. Fundamentally, as we are patching a compiled binary, we can replace an instruction with a another one only if it is not longer.

jmp QWORD PTR [rip+0x21265f] is the same as the previous one, but instead of calling an address it jumps to it. This turns into:

  425903:       e9 b8 f7 ff ff          jmp    4250c0 <_ZN68_$LT$core..fmt..builders..PadAdapter$u20$as$u20$core..fmt..Write$GT$9write_str17hc384e51187942069E>
  425908:       90                      nop

cmp rbx,QWORD PTR [rip+0x21a5b4] - this takes 8 bytes from [rip+0x21a5b4] and compares them to the contents of rbx register. This one is tricky, since cmp can not compare register contents to an 64-bit immediate value. We could use another register for that, but we don't know which of the registers are used around this instruction. A careful solution would be something like

push rax
mov rax,0x0000006363c0
cmp rbx,rax
pop rax

But that would be way beyond our limit of 7 bytes. The real solution stems from an observation that the GOT contains only addresses; our address space is (roughly) contained in range [0x400000; 0x650000], which can be seen in the program headers:

$ readelf -l hello
...
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000035b50 0x0000000000035b50  R E    0x200000
  LOAD           0x0000000000036380 0x0000000000636380 0x0000000000636380
                 0x0000000000001dd0 0x0000000000003918  RW     0x200000
...

It follows that we can (mostly) get away with only comparing 4 bytes of a GOT entry instead of 8. So the substitution is:

  41d8b5:       81 fb c0 63 63 00       cmp    ebx,0x6363c0
  41d8bb:       90                      nop

The last one consists of two lines of objdump output, since 8 bytes do not fit in one line:

  40b259:       48 83 3d 7f cb 22 00    cmp    QWORD PTR [rip+0x22cb7f],0x0    # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
  40b260:       00

It just compares 8 bytes of the GOT to a constant (in this case, 0x0). In fact, we can do the comparison statically; if the operands compare equal, we replace the comparison with

  40b259:       48 39 c0                cmp    rax,rax
  40b25c:       90                      nop
  40b25d:       90                      nop
  40b25e:       90                      nop
  40b25f:       90                      nop
  40b260:       90                      nop

Obviously, a register is always equal to itself. A lot of padding needed here!

If the left operand is greater than the right one, we replace the comparison with

  40b259:       48 83 fc 00             cmp    rsp,0x0 
  40b25d:       90                      nop
  40b25e:       90                      nop
  40b25f:       90                      nop
  40b260:       90                      nop

In practice, rsp is always greater than zero.

If the left operand is smaller than the right one, things get a bit more complicated, but since we have a whole lot of bytes (8!) we can manage:

  40b259:  50                      push   rax
  40b25a:  31 c0                   xor    eax,eax
  40b25c:  83 f8 01                cmp    eax,0x1
  40b25f:  58                      pop    rax
  40b260:  90                      nop

Notice that the second and the third instructions use eax instead of rax, since cmp and xor involving eax take one less byte than with rax.

Testing

I have written a Python script to do all these substitutions automatically (it's a bit hacky and relies on parsing of objdump output though):

#!/usr/bin/env python3

import re
import sys
import argparse
import subprocess

def read_u64(binary):
    return sum(binary[i] * 256 ** i for i in range(8))

def distance_u32(start, end):
    assert abs(end - start) < 2 ** 31
    diff = end - start
    if diff < 0:
        return 2 ** 32 + diff
    else:
        return diff

def to_u32(x):
    assert 0 <= x < 2 ** 32
    return bytes((x // (256 ** i)) % 256 for i in range(4))

class GotInstruction:
    def __init__(self, lines, symbol_address, symbol_offset):
        self.address = int(lines[0].split(":")[0].strip(), 16)
        self.offset = symbol_offset + (self.address - symbol_address)
        self.got_offset = int(lines[0].split("(File Offset: ")[1].strip().strip(")"), 16)
        self.got_offset = self.got_offset % 0x200000  # No idea why the offset is actually wrong
        self.bytes = []
        for line in lines:
            self.bytes += [int(x, 16) for x in line.split("\t")[1].split()]

class TextDump:
    symbol_regex = re.compile(r"^([0-9,a-f]{16}) <(.*)> \(File Offset: 0x([0-9,a-f]*)\):")

    def __init__(self, binary_path):
        self.got_instructions = []
        objdump_output = subprocess.check_output(["objdump", "-Fdj", ".text", "-M", "intel",
                                                  binary_path])
        lines = objdump_output.decode("utf-8").split("\n")
        current_symbol_address = 0
        current_symbol_offset = 0
        for line_group in self.group_lines(lines):
            match = self.symbol_regex.match(line_group[0])
            if match is not None:
                current_symbol_address = int(match.group(1), 16)
                current_symbol_offset = int(match.group(3), 16)
            elif "_GLOBAL_OFFSET_TABLE_" in line_group[0]:
                instruction = GotInstruction(line_group, current_symbol_address,
                                             current_symbol_offset)
                self.got_instructions.append(instruction)

    @staticmethod
    def group_lines(lines):
        if not lines:
            return
        line_group = [lines[0]]
        for line in lines[1:]:
            if line.count("\t") == 1:  # this line continues the previous one
                line_group.append(line)
            else:
                yield line_group
                line_group = [line]
        yield line_group

    def __iter__(self):
        return iter(self.got_instructions)

def read_binary_file(path):
    try:
        with open(path, "rb") as f:
            return f.read()
    except (IOError, OSError) as exc:
        print(f"Failed to open {path}: {exc.strerror}")
        sys.exit(1)

def write_binary_file(path, content):
    try:
        with open(path, "wb") as f:
            f.write(content)
    except (IOError, OSError) as exc:
        print(f"Failed to open {path}: {exc.strerror}")
        sys.exit(1)

def patch_got_reference(instruction, binary_content):
    got_data = read_u64(binary_content[instruction.got_offset:])
    code = instruction.bytes
    if code[0] == 0xff:
        assert len(code) == 6
        relative_address = distance_u32(instruction.address, got_data)
        if code[1] == 0x15:  # call QWORD PTR [rip+...]
            patch = b"\xe8" + to_u32(relative_address - 5) + b"\x90"
        elif code[1] == 0x25:  # jmp QWORD PTR [rip+...]
            patch = b"\xe9" + to_u32(relative_address - 5) + b"\x90"
        else:
            raise ValueError(f"unknown machine code: {code}")
    elif code[:3] == [0x48, 0x83, 0x3d]:  # cmp QWORD PTR [rip+...],<BYTE>
        assert len(code) == 8
        if got_data == code[7]:
            patch = b"\x48\x39\xc0" + b"\x90" * 5  # cmp rax,rax
        elif got_data > code[7]:
            patch = b"\x48\x83\xfc\x00" + b"\x90" * 3  # cmp rsp,0x0
        else:
            patch = b"\x50\x31\xc0\x83\xf8\x01\x90"  # push rax
                                                     # xor eax,eax
                                                     # cmp eax,0x1
                                                     # pop rax
    elif code[:3] == [0x48, 0x3b, 0x1d]:  # cmp rbx,QWORD PTR [rip+...]
        assert len(code) == 7
        patch = b"\x81\xfb" + to_u32(got_data) + b"\x90"  # cmp ebx,<DWORD>
    else:
        raise ValueError(f"unknown machine code: {code}")
    return dict(offset=instruction.offset, data=patch)

def make_got_patches(binary_path, binary_content):
    patches = []
    text_dump = TextDump(binary_path)
    for instruction in text_dump.got_instructions:
        patches.append(patch_got_reference(instruction, binary_content))
    return patches

def apply_patches(binary_content, patches):
    for patch in patches:
        offset = patch["offset"]
        data = patch["data"]
        binary_content = binary_content[:offset] + data + binary_content[offset + len(data):]
    return binary_content

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("binary_path", help="Path to ELF binary")
    parser.add_argument("-o", "--output", help="Output file path", required=True)
    args = parser.parse_args()

    binary_content = read_binary_file(args.binary_path)
    patches = make_got_patches(args.binary_path, binary_content)
    patched_content = apply_patches(binary_content, patches)
    write_binary_file(args.output, patched_content)

if __name__ == "__main__":
    main()

Now we can get rid of the GOT for real:

$ cargo build --release --target x86_64-unknown-linux-musl
$ ./resolve_got.py target/x86_64-unknown-linux-musl/release/hello -o hello_no_got
$ objcopy -R.got hello_no_got
$ readelf -e hello_no_got | grep .got
$ ./hello_no_got
Hello, world!

I have also tested it on my ~3k LOC app, and it seems to work alright.

P.S. I am not an expert in assembly, so some of the above might be inaccurate.

Why use the Global Offset Table for symbols defined in the shared library itself?

The Global Offset Table serves two purposes. One is to allow the dynamic linker "interpose" a different definition of the variable from the executable or other shared object. The second is to allow position independent code to be generated for references to variables on certain processor architectures.

ELF dynamic linking treats the entire process, the executable and all of the shared objects (dynamic libraries), as sharing one single global namespace. If multiple components (executable or shared objects) define the same global symbol then the dynamic linker normally chooses one definition of that symbol and all references to that symbol in all components refer to that one definition. (However, the ELF dynamic symbol resolution is complex and for various reasons different components can end up using different definitions of the the same global symbol.)

To implement this, when building a shared library the compiler will access global variables indirectly through the GOT. For each variable an entry in the GOT will be created containing a pointer to the variable. As your example code shows, the compiler will then use this entry to obtain the address of variable instead of trying to access it directly. When the shared object is loaded into a process the dynamic linker will determine whether any of the global variables have been superseded by variable definitions in another component. If so those global variables will have their GOT entries updated to point at the superseding variable.

By using the "hidden" or "protected" ELF visibility attributes it's possible to prevent global defined symbol from being superseded by a definition in another component, and thus removing the need to use the GOT on certain architectures. For example:

extern int global_visible;
extern int global_hidden __attribute__((visibility("hidden")));
static volatile int local;  // volatile, so it's not optimized away

int
foo() {
    return global_visible + global_hidden + local;
}

when compiled with -O3 -fPIC with the x86_64 port of GCC generates:

foo():
        mov     rcx, QWORD PTR global_visible@GOTPCREL[rip]
        mov     edx, DWORD PTR local[rip]
        mov     eax, DWORD PTR global_hidden[rip]
        add     eax, DWORD PTR [rcx]
        add     eax, edx
        ret

As you can see, only global_visible uses the GOT, global_hidden and local don't use it. The "protected" visibility works similarly, it prevents the definition from being superseded but makes it still visible to the dynamic linker so it can be accessed by other components. The "hidden" visibility hides the symbol completely from the dynamic linker.

The necessity of making code relocatable in order allow shared objects to be loaded a different addresses in different process means that statically allocated variables, whether they have global or local scope, can't be accessed with directly with a single instruction on most architectures. The only exception I know of is the 64-bit x86 architecture, as you see above. It supports memory operands that are both PC-relative and have large 32-bit displacements that can reach any variable defined in the same component.

On all the other architectures I'm familiar with accessing variables in position dependent manner requires multiple instructions. How exactly varies greatly by architecture, but it often involves using the GOT. For example, if you compile the example C code above with x86_64 port of GCC using the -m32 -O3 -fPIC options you get:

foo():
        call    __x86.get_pc_thunk.dx
        add     edx, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_
        push    ebx
        mov     ebx, DWORD PTR global_visible@GOT[edx]
        mov     ecx, DWORD PTR local@GOTOFF[edx]
        mov     eax, DWORD PTR global_hidden@GOTOFF[edx]
        add     eax, DWORD PTR [ebx]
        pop     ebx
        add     eax, ecx
        ret
__x86.get_pc_thunk.dx:
        mov     edx, DWORD PTR [esp]
        ret

The GOT is used for all three variable accesses, but if you look closely global_hidden and local are handled differently than global_visible. With the later, a pointer to the variable is accessed through the GOT, with former two variables they're accessed directly through the GOT. This a fairly common trick among architectures where the GOT is used for all position independent variable references.

The 32-bit x86 architecture is exceptional in one way here, since it has large 32-bit displacements and a 32-bit address space. This means that anywhere in memory can be accessed through the GOT base, not just the GOT itself. Most other architectures only support much smaller displacements, which makes the maximum distance something can be from the GOT base much smaller. Other architectures that use this trick will only put small (local/hidden/protected) variables in the GOT itself, large variables are stored outside the GOT and the GOT will contain a pointer to the variable just like with normal visibility global variables.

Understanding GOT (Global Offset Table) and PLT?

(2) - that's exactly what gcc -fno-plt does; using call puts@gotpcrel(%rip) which references the normal GOT entry, not the part of the GOT that's updated by PLT stubs.

See x86_64: Is it possible to "in-line substitute" PLT/GOT references?

(1) "Each shared library has its own GOT" means as opposed to having one per process. It's not saying that there's only one GOT for the library in shared memory that every process using the library maps.

Remember that Unix-like OSes (like all modern mainstream OSes) use virtual memory to isolate processes from each other, so it normally goes without saying that every process has its own independent copy of read/write data.

Of course global variables like errno or environ aren't shared between processes using the same library, that would break things so you can rule out that interpretation. (As well as being not what dynamic linking is doing if you strace /bin/ls)

How are the entries for Global Symbols that are not functions initialized in the Global Offset Table?

This will eventually pass control on to the dynamic linker that will load the library, update the GOT entry and jump to function

This is only partially correct: the library will normally already be loaded, and the loader only resolves the symbol and updates the GOT entry to point to the symbol definition.

Now for other global symbols that are not functions (do not have PLT entries), how or when will they be initialized?

When the library (or executable) referencing the symbol is loaded, the loader resolves all the data symbols in it, before making it available.

Can it be done lazily?

No.

Why Are There Global Offset Tables and Procedure Linkage Tables in Statically Linked Executables