How Does the Compilation/Linking Process Work

How does the compilation/linking process work?

The compilation of a C++ program involves three steps:

Preprocessing: the preprocessor takes a C++ source code file and deals with the #includes, #defines and other preprocessor directives. The output of this step is a "pure" C++ file without pre-processor directives.
Compilation: the compiler takes the pre-processor's output and produces an object file from it.
Linking: the linker takes the object files produced by the compiler and produces either a library or an executable file.

Preprocessing

The preprocessor handles the preprocessor directives, like #include and #define. It is agnostic of the syntax of C++, which is why it must be used with care.

It works on one C++ source file at a time by replacing #include directives with the content of the respective files (which is usually just declarations), doing replacement of macros (#define), and selecting different portions of text depending of #if, #ifdef and #ifndef directives.

The preprocessor works on a stream of preprocessing tokens. Macro substitution is defined as replacing tokens with other tokens (the operator ## enables merging two tokens when it makes sense).

After all this, the preprocessor produces a single output that is a stream of tokens resulting from the transformations described above. It also adds some special markers that tell the compiler where each line came from so that it can use those to produce sensible error messages.

Some errors can be produced at this stage with clever use of the #if and #error directives.

Compilation

The compilation step is performed on each output of the preprocessor. The compiler parses the pure C++ source code (now without any preprocessor directives) and converts it into assembly code. Then invokes underlying back-end(assembler in toolchain) that assembles that code into machine code producing actual binary file in some format(ELF, COFF, a.out, ...). This object file contains the compiled code (in binary form) of the symbols defined in the input. Symbols in object files are referred to by name.

Object files can refer to symbols that are not defined. This is the case when you use a declaration, and don't provide a definition for it. The compiler doesn't mind this, and will happily produce the object file as long as the source code is well-formed.

Compilers usually let you stop compilation at this point. This is very useful because with it you can compile each source code file separately. The advantage this provides is that you don't need to recompile everything if you only change a single file.

The produced object files can be put in special archives called static libraries, for easier reusing later on.

It's at this stage that "regular" compiler errors, like syntax errors or failed overload resolution errors, are reported.

Linking

The linker is what produces the final compilation output from the object files the compiler produced. This output can be either a shared (or dynamic) library (and while the name is similar, they haven't got much in common with static libraries mentioned earlier) or an executable.

It links all the object files by replacing the references to undefined symbols with the correct addresses. Each of these symbols can be defined in other object files or in libraries. If they are defined in libraries other than the standard library, you need to tell the linker about them.

At this stage the most common errors are missing definitions or duplicate definitions. The former means that either the definitions don't exist (i.e. they are not written), or that the object files or libraries where they reside were not given to the linker. The latter is obvious: the same symbol was defined in two different object files or libraries.

What does linking in the compilation process actually do?

If I have a tiny C program (for example a hello world program)

Even your helloworld program does use #inlude<stdio.h>, doesn't it? That means you're using a library, and the linking step is there to combine the necessary object code (here the library code) to create a binary for you.

For a detailed descriptions of what the linking step does (and compare with compiling) - see this question

What is the difference between 'compiling and linking' and just 'compiling' (with g++)?

The code of a single C or C++ program may be split among multiple C or C++ files. These files are called translation units.

Compiling transforms each translation unit to a special format representing the binary code that belongs to a single translation unit, along with some additional information to connect multiple units together.

For example, one could define a function in one a.c file, and call it from b.c file. The format places the binary code of the function into a.o, and also records at what location the code of the function starts. File b.c records all references to the function into b.o.

Linking connects references from b.o to the function from a.o, producing the final executable of the program.

Splitting translation process into two stages, compilation and linking, is done to improve translation speed. When you modify a single file form a set of a dozen of files or so, only that one file needs to be compiled, while the remaining .o files could be reused form previous translation.

Unclear on linking vs compilation

Preprocessing happens before compilation. The preprocessor takes a one or more source files, and outputs another source file, which is then compiled. The preprocessor is a text-to-text transformer, it has nothing to do with linking.

It is conceptually possible to dump everything in one source file using a preprocessor, and then compile it directly to an executable, skipping the stages of producing object files and linking them together. However this would be extremely inconvenient in practice. Imagine a 100,000,000 lines of code program (this includes all the standard library and all the platform libraries and all the third-party libraries). You need to change one line. Would you be willing to compile all 100,000,000 lines again? and when you make an error in that one line, do it again (and again and again and again and again)?

Some libraries are distributed entirely as header files. They do not need any binary files, and are compiled with your program every time the program is compiled. But not all libraries are like that. Some are too big to be compiled every time. Some are not written in C or C++ (they require bits of assembly language for example, or perhaps Fortran). Some cannot be distributed as source code because the vendors are unwilling to do so for copyright reasons. In all these cases, the solution is to compile the libraries to object files, and then distribute these object files together with headers that contain just interfaces (declarations with no definitions) of functions and variables they expose.

<iostream> that you mention is a mixed bag. In most implementations it contains both function definitions (templates and small inline functions) that you compile every time when your program is compiled, and declarations of external functions, whose definitions are compiled by the vendor and distributed as a precompiled library.

Compiling process in terms of a simple c++ program [duplicate]

This is a very broad question in general but i'll try to answer as briefly as possible.
A typical language processing system has the following phases :

1. Preprocessing Phase - In this phase all preprocessors and macros are handled and code is generated which is free from these. This involves replacing macro calls with macro body and replacing the formal parameters with the actual parameter.

2. Compilation Phase - This has several smaller phases such as:
Lexical Analysis , Syntax Analysis , Semantic Analysis , Intermediate code generation , code optimization , target code generation , etc.
The Compilation phase may/may not produce assembly code. There are separate pros and cons of both the approaches. We will assume that assembly code was produced in this discussion.

3. Assembly Phase - The assembler converts the output of compiler to target code . Assemblers can be one pass or two pass in nature.

4. Linking Phase - The code that has been produced has many references and calls to subroutines which are defined in other modules. Such modules are linked to the code in this phase and the addresses are assigned to such instructions which have outside references.

5. Loading Phase - In this phase , all the segments which are produced in the previous phase get loaded into the RAM for actual execution and control is passed to the first instruction.

All components listed in this answer have many intricacies and sub-parts and in no way are a complete explanation of a language processor.

There are books such by authors DM Dhamdere , Tannenbaum and Alfred Aho on these topics which are useful.

How exactly does linking work?

At this point you have an executable.

No. At this point, you have object files, which are not, in themselves, executable.

But if you actually run that executable what happens?

Something like this:

h2co3-macbook:~ h2co3$ clang -Wall -o quirk.o quirk.c -c
h2co3-macbook:~ h2co3$ chmod +x quirk.o
h2co3-macbook:~ h2co3$ ./quirk.o
-bash: ./quirk.o: Malformed Mach-o file

I told you it was not an executable.

Is the problem that you may have included *.h files, and those only contain function prototypes?

Pretty close, actually. A translation unit (.c file) is (generally) transformed to assembly/machine code that represents what it does. If it calls a function, then there will be a reference to that function in the file, but no definition.

So if you actually call one of the functions from those files, it won't have a definition and your program will crash?

As I've stated, it won't even run. Let me repeat: an object file is not executable.

what exactly does linking do, under the hood? How does it find the .c file associated with the .h that you included [...]

It doesn't. It looks for other object files generated from .c files, and eventually libraries (which are essentially just collections of other object files).

And it finds them because you tell it what to look for. Assuming you have a project which consists of two .c files which call each other's functions, this won't work:

gcc -c file1.c -o file1.o
gcc -c file2.c -o file2.o
gcc -o my_prog file1.o

It will fail with a linker error: the linker won't find the definition of the functions implemented in file2.c (and file2.o). But this will work:

gcc -c file1.c -o file1.o
gcc -c file2.c -o file2.o
gcc -o my_prog file1.o file2.o

[...] and how does it inject that into your machine code?

Object files contain stub references (usually in the form of function entry point addresses or explicit, human-readable names) to the functions they call. Then, the linker looks at each library and object file, finds the references (, throws an error if a function definition couldn't be found), then substitutes the stub references with actual "call this function" machine code instructions. (Yes, this is largely simplified, but without you asking about a specific architecture and a specific compiler/linker, it's hard to tell more precisely...)

Is static when you actually recompile the source of the library for every executable you create?

No. Static linkage means that the machine code of the object files of a library are actually copied/merged into your final executable. Dynamic linkage means that a library is loaded into memory once, then the aforementioned stub function references are resolved by the operating system when your executable is launched. No machine code from the library will be copied into your final executable. (So here, the linker in the toolchain only does part of the job.)

The following may help you to achieve enlightenment: if you statically link an executable, it will be self-contained. It will run anywhere (on a compatible architecture anyway). If you link it dynamically, it will only run on a machine if that particular machine has all the libraries installed that the program references.

So you compile one executable library that is shared by all of your processes that use it? How is that possible, exactly? Wouldn't it be outside of the address space of the processes trying to access it?

The dynamic linker/loader component of the OS takes care all of that.

Also, for dynamic linking, don't you still need to compile the library at some juncture in time?

As I've already mentioned: yes, it is already compiled. Then it is loaded at some point (typically when it's first used) into memory.

When is it compiled?

Some time before it could be used. Typically, a library is compiled, then installed to a location on your system so that the OS and the compiler/linker know about its existence, then you can start compiling (um, linking) programs that use that library. Not earlier.

How compilation and linking at runtime is happening?

Runtime compilation

The best (most well known) example I'm personally aware of is the just in time compilation used by Java. As you might know Java code is being compiled into bytecode which can be interpreted by the Java Virtual Machine. It's therefore different from let's say C++ which is first fully (preprocessed) compiled (and linked) into an executable which can be ran directly by the OS without any virtual machine.

The Java bytecode is instead interpreted by the VM, which maps them to processor specific instructions. That being said the JVM does JIT, which takes that bytecode and compiles it (during runtime) into machine code. Here we arrive at your second question. Even in Java it can depend on which JVM you are using but basically there are pieces of code called hotspots, the pieces of code that are run frequently and which might be compiled so the application's performance improves. This is done during runtime because the normal compiler does not have (or well might not have) all the necessary data to make a proper judgement which pieces of code are in fact ran frequently. Therefore JIT requires some kind of runtime statistics gathering, which is done parallel to the program execution and is done by the JVM. What kind of statistics are gathered, what can be optimised (compiled in runtime) etc. depends on the implementation (you obviously cannot do everything a normal compiler would do due to memory and time constraints - guess this partly answers the first question? you don't compile everything and usually only a limited set of optimisations are supported in runtime compilation). You can try looking for such info but from my experience usually it's very badly documented and hard to find (at least when it comes to official sources, not presentations/blogs etc.)

Runtime linking

Linker is a different pair of shoes. We cannot use the Java example anymore since it doesn't really have a linker like C or C++ (instead it has a classloader which takes care of loading files and putting it all together).

Usually linking is performed by a linker after the compilation step (static linking), this has pros (no dependencies) and cons (higher memory imprint as we cannot use a shared library, when the library number changes you need to recompile your sources).

Runtime linking (dynamic/late linking) is actually performed by the OS and it's the OS linker's job to first load shared libraries and then attach them to a running process. Furthermore there are also different types of dynamic linking: explicit and implicit. This has the benefit of not having to recompile the source when the version number changes since it's dynamic and library sharing but also drawbacks, what if you have different programs that use the same library but require different versions (look for DLL hell). So yes those two concepts are also quite different.

Again how it's all done, how it's decided what and how should be linked, is OS specific, for instance Microsoft has the dynamic-link library concept.

How Does the Compilation/Linking Process Work