Where Do Object File "Version References" Come From

Where do object file Version References come from?

Or from something else?

From something else.

When you build a shared library (say libfoo.so), you can (though don't have to) supply a linker version script giving certain symbols a version tag.

When you later link an executable or a shared library (say libbar.so) against libfoo.so, iff you use a versioned symbol, the version tag of that symbol is recorded in libbar.so (that is what you observed in your question).

This setup allows libfoo.so to change its symbols in ABI-incompatible way, and still support old client programs that were linked against the old symbols.

For example, libc.so.6 on x86_64 has the following versions of memcpy:

0000000000091620 g   iD  .text  000000000000003d  GLIBC_2.14  memcpy
000000000008c420 g iD .text 0000000000000047 (GLIBC_2.2.5) memcpy

Programs that were linked against glibc-2.13 or older will use the GLIBC_2.2.5 version, programs that were linked against glibc-2.14 or newer will use the GLIBC_2.14 version.

If you try to run a program linked against glibc-2.14 on a system with glibc-2.13, you will get an error (missing symbol version), similar to this.

Before the introduction of symbol versioning, changing the ABI of an existing symbol required that you ship an entirely separate library. This is called external library versioning. You can read more about it here.

What does an object file contain?

Object files can contain a bunch of stuff: Basically it's some or all of the list below:

  • Symbol Names
  • Compiled code
  • Constant data, eg. strings
  • Imports - which symbols the compiled code references (gets fixed up by linker)
  • Exports - which symbols the object file makes available to OTHER object files.

The linker turns a bunch of object files into an executable, by matching up all the imports and exports, and modifying the compiled code so the correct functions get called.

What's an object file in C?

An object file is the real output from the compilation phase. It's mostly machine code, but has info that allows a linker to see what symbols are in it as well as symbols it requires in order to work. (For reference, "symbols" are basically names of global objects, functions, etc.)

A linker takes all these object files and combines them to form one executable (assuming that it can, ie: that there aren't any duplicate or undefined symbols). A lot of compilers will do this for you (read: they run the linker on their own) if you don't tell them to "just compile" using command-line options. (-c is a common "just compile; don't link" option.)

Exactly what is a symbol reference in an object file?

Consider the following source:

static int foo() { return 42; }
static int bar() { return foo() + 1; }

extern int baz();

int main()
{
return foo() + bar() + baz();
}

After gcc -c foo.c, the output from objdump -d foo.o on x86_64 Linux is:

foo.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 2a 00 00 00 mov $0x2a,%eax
9: 5d pop %rbp
a: c3 retq

000000000000000b <bar>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: e8 e7 ff ff ff callq 0 <foo>
19: 83 c0 01 add $0x1,%eax
1c: 5d pop %rbp
1d: c3 retq

000000000000001e <main>:
1e: 55 push %rbp
1f: 48 89 e5 mov %rsp,%rbp
22: 53 push %rbx
23: 48 83 ec 08 sub $0x8,%rsp
27: b8 00 00 00 00 mov $0x0,%eax
2c: e8 cf ff ff ff callq 0 <foo>
31: 89 c3 mov %eax,%ebx
33: b8 00 00 00 00 mov $0x0,%eax
38: e8 ce ff ff ff callq b <bar>
3d: 01 c3 add %eax,%ebx
3f: b8 00 00 00 00 mov $0x0,%eax
44: e8 00 00 00 00 callq 49 <main+0x2b>
49: 01 d8 add %ebx,%eax
4b: 48 83 c4 08 add $0x8,%rsp
4f: 5b pop %rbx
50: 5d pop %rbp
51: c3 retq

There are a few things to note here:

  1. Notice how bar calls foo at address 0?
    How does objdump know that it's foo that's being called?
    And can it really be at address 0? (Most modern systems map zero page of virtual memory with PROT_NONE, so no read or write access can happen there.)
  2. Notice how call to baz from main is different from calls to foo and bar? The compiler knows where foo and bar are relative to the call instruction itself, but it has no idea where baz will be.

So, given above info, how can the linker turn this into something sensible? It can't: there is not enough info here.

In order for the linker to be able to link reference to baz (which we don't yet see) into a call to baz, it needs additional info. On ELF systems, that additional info is written into a special section .rela.text here, which contains:

$ readelf -Wr foo.o

Relocation section '.rela.text' at offset 0x5d0 contains 1 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000045 0000000b00000002 R_X86_64_PC32 0000000000000000 baz - 4

That is the "reference" that the book talks about, but doesn't define. It tells the linker: if you can find a definition of baz (in some other object), take its address, and put it (actually, &baz - 4 because the CALL instruction is relative to the next instruction after the CALL) into bytes [45-48] of .text section of foo.o.

And if there is no such definition? The linker will produce an error:

$ gcc foo.o
foo.o: In function `main':
foo.c:(.text+0x45): undefined reference to `baz'
collect2: error: ld returned 1 exit status

Finally, getting to point 1 above: can the foo really be at address 0?

No, but the CALL instruction at address 0x14 doesn't actually say CALL 0. It says "call routine at address of the next instruction after the call, minus 25". If that call instruction in the final binary ends up at address 0x400501, then the target of that call will be 0x4004ed, which is where foo will end up (the distance between foo and the CALL will not change when the linker relocates .text section of foo.o to a different address (linker relaxation notwithstanding; but that's a complicated topic for another day).

How to view symbols in object files?

Instead of nm, you can use the powerful objdump. See the man page for details. Try objdump -t myfile or objdump -T myfile. With the -C flag you can also demangle C++ names, like nm does.

Object files vs Library files and why?

Historically, an object file gets linked either completely or not at all into an executable (nowadays, there are exceptions like function level linking or whole program optimization becoming more popular), so if one function of an object file is used, the executable receives all of them.

To keep executables small and free of dead code, the standard library is split into many small object files (typically in the order of hundreds). Having hundreds of small files is very undesirable for efficiency reasons: Opening many files is inefficient, and every file has some slack (unused disk space at the end of the file). This is why object files get grouped into libraries, which is kind of like a ZIP file with no compression. At link time, the whole library is read, and all object files from that library that resolve symbols already known as unresolved when the linker started reading a library or object files needed by them are included into the output. This likely means that the whole library has to be in memory at once to recursively solve dependencies. As the amount of memory was quite limited, the linker only loads one library at a time, so a library mentioned later on the command line of the linker can not use functions from a library mentioned earlier on the command line.

To improve the performance (loading a whole library takes some time, especially from slow media like floppy disks), libraries often contain an index that tells the linker what object files provide which symbols. Indexes are created by tools like ranlib or the library management tool (Borland's tlib has a switch to generate the index). As soon as there is an index, libraries are definitely more efficient to link then single object files, even if all object files are in the disk cache and loading files from the disk cache is free.

You are completely right that I can replace .o or .a files while keeping the header files, and change what the functions do (or how they do it). This is used by the LPGL-license, which requires the author of a program that uses an LGPL-licensed library to give the user the possibility to replace that library by a patched, improved or alternative implementation. Shipping the object files of the own application (possibly grouped as library files) is enough to give the user the required freedom; no need to ship the source code (like with the GPL).

If two sets of libraries (or object files) can be used successfully with the same header files, they are said to be ABI compatible, where ABI means Application Binary Interface. This is more narrow than just having two sets of libraries (or object files) accompanied by their respective headers, and guaranteeing that you can use each library if you use the headers for this specific library. This would be called API compatibility, where API means Application Program Interface. As an example of the difference, look at the following three header files:

File 1:

typedef struct {
int a;
int __undocumented_member;
int b;
} magic_data;
magic_data* calculate(int);

File 2:

struct __tag_magic_data {
int a;
int __padding;
int b;
};
typedef __tag_magic_data magic_data;
magic_data* calculate(const int);

File 3:

typedef struct {
int a;
int b;
int c;
} magic_data;
magic_data* do_calculate(int, void*);
#define calculate(x) do_calculate(x, 0)

The first two files are not identical, but they provide exchangeable definitions that (as far as I expect) do not violate the "one definition rule", so a library providing File 1 as header file can be used as well with File 2 as a header file. On the other hand, File 3 provides a very similar interface to the programmer (which might be identical in all that the library author promises the user of the library), but code compiled with File 3 fails to link with a library designed to be used with File 1 or File 2, as the library designed for File 3 would not export calculate, but only do_calculate. Also, the structure has a different member layout, so using File 1 or File 2 instead of File 3 will not access b correctly. The libraries providing File 1 and File 2 are ABI compatible, but all three libraries are API compatible (assuming that c and the more capable function do_calculate do not count towards that API).

For dynamic libraries (.dll, .so) things are completely different: They started appearing on systems where multiple (application) programs can be loaded at the same time (which is not the case on DOS, but it is the case on Windows). It is wasteful to have the same implementation of a library function in memory multiple times, so it is loaded only once and multiple applications use it. For dynamic libraries, the code of the referenced function is not included in the executable file, but just a reference to the function inside a dynamic library is included (For Windows NE/PE, it is specified which DLL has to provide which function. For Unix .so files, only the function names and a set of libraries are specified.). The operating system contains a loader aka dynamic linker that resolves these references and loads dynamic libraries if they are not already in memory at the time a program is started.

GCC: how to find why an object file is not discarded

Use -Wl,-M to pass -M to the linker, causing it to print a link trace. This will show you the reasons (or at least the first-found reason) for every object file that gets linked from an archive.

What are the obj and bin folders (created by Visual Studio) used for?

The obj folder holds object, or intermediate, files, which are compiled binary files that haven't been linked yet. They're essentially fragments that will be combined to produce the final executable. The compiler generates one object file for each source file, and those files are placed into the obj folder.

The bin folder holds binary files, which are the actual executable code for your application or library.

Each of these folders are further subdivided into Debug and Release folders, which simply correspond to the project's build configurations. The two types of files discussed above are placed into the appropriate folder, depending on which type of build you perform. This makes it easy for you to determine which executables are built with debugging symbols, and which were built with optimizations enabled and ready for release.

Note that you can change where Visual Studio outputs your executable files during a compile in your project's Properties. You can also change the names and selected options for your build configurations.



Related Topics



Leave a reply



Submit